Unpacking the reasoning in modern LLM: why are the final answers not sufficient
The recent progress of LLMs focused on reasoning like O1 / 3 of Openai and Deepseek-R1 have led to notable improvements on complex tasks. However, the reasoning step by step behind these models is not clear. Most evaluations focus on the final accuracy of the responses, which masks the reasoning process and does not reveal how models combine knowledge and logic. Certain previous methods try to measure reasoning by comparing the answers to the initial question, but this approach is wrong because the models are often based on prior deductions or internal knowledge. Domains such as mathematics and medicine differ in their reasoning needs, highlighting the importance of developing better conscious evaluation methods in the field to create a trustworthy AI.
Gaps in final assessments in mathematics and medicine
Recent LLMs have made impressive progress in reasoning tasks, in particular in mathematics and medicine, thanks to better training data and reward strategies. However, most of this progress focuses on increasing the accuracy of final responses rather than understanding how the model reasoned step by step. Previous work has reported factual errors in the reasoning chains or measured the similarity between the stages of reasoning and the initial question. But such a similarity does not guarantee logical solidity or factual accuracy, because LLMs often rely on internal knowledge or previous reasoning.
A new framework to separate knowledge and logic in LLM reasoning
Researchers from UC Santa Cruz, Stanford and Tongji University go beyond the final evaluation by decomposing LLM reasoning into two key parts: factual knowledge and logical stages. They introduce a detailed framework which uses two measures: the knowledge index (KI) for factual precision and the information gain (infogain) for the quality of the reasoning. Their analysis of Qwen models through mathematical and medical tasks reveals that reasoning skills are not easily transferred between the fields. Although the supervised fine setting improves precision, this often harms the depth of reasoning. Learning strengthening helps, however, refine reasoning by deleting unrelevant information. This work highlights the importance of evaluating and forming more thoughtful LLM.
Reasoning assessment with QWEN2.5-7B and Deepseek-R1 models
The researchers evaluate reasoning in the LLM by analyzing Qwen2.5-7b and its dissuffed version of Depth-R1, formed with SFT and RL. Using tasks of mathematical and medical domains, they break down the logical steps responses and evaluate them using two key measurements: the gain of information (the quantity of uncertainty is reduced to each reasoning stage) and the knowledge index (how exact each step is, verified against expert sources). While the infogation follows the information of each step, Ki checks whether the knowledge align with the facts of the real world. This approach reveals how the models reason and where they can vacillate in precision or logic.
Authorly finished in relation to learning to strengthen in tasks specific to the field
The study assesses two variants of Qwen-2.5-7b-The Qwen base and the QWEN-R1 distilled on medical tasks. The results show that the Qwen base constantly surpasses Qwen-R1 in precision, knowledge retention and reasoning, in particular after SFT and RL. The distilled model is likely to struggle due to the prior training focused on mathematics and the code, which leads to a difference in the domain. Interestingly, SFT improves medical knowledge more effectively than RL, although it can slightly compromise the effectiveness of reasoning. RL, on the other hand, improves both reasoning and knowledge when applied after the SFT. Medical references tend to rely more on factual knowledge than abstract reasoning, unlike mathematics focused.
Conclusion: towards LLM more interpretable and trustworthy
In conclusion, the study introduces a framework that separates knowledge from reasoning to better assess how LLM think, in particular in the fields with high issues such as medicine and mathematics. Using QWEN models formed with SFT and RL, researchers have found that if SFT improves factual, essential medical accuracy, it often weakens reasoning. RL, however, improves reasoning by reducing incorrect information. The framework could be extended to areas such as law or finance, where structured thinking is crucial. Overall, this approach helps to clarify how LLM make decisions and suggest means to adapt their training to specific areas.
Discover the Paper,, Code And Project page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 99K + ML Subreddit and subscribe to Our newsletter.
Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.
