Large language models (LLM) have shown remarkable reasoning capacities through various tasks, with strengthening learning (RL) serving as a crucial mechanism to refine their deep thought capacities. Although RL techniques have shown particular success in mathematical reasoning and coding fields with well -defined rules and verifiable accuracy criteria, extending these approaches to broader reasoning contexts presents significant challenges, including limited training data and difficulties in ensuring the generalization of the transversal domain.
Evolution of reasoning in LLM
The development of the methodology of the chain of thoughts (COT) has marked significant progression in LLM reasoning capacities. Bed bed has demonstrated substantial improvement through mathematics, science, And programming Domains by incorporating intermediate reasoning processes into several stages before drawing conclusions. This approach allows models to decompose complex problems with manageable steps, reflecting human problem solving processes.
While mathematical reasoning has dominated recent research due to its verifiable nature, the expansion of RL training in various fields remains largely unexplored. Previous research work suggests that the mixture of mathematical content with other verifiable areas can improve performance on broad reasoning benchmarks. However, a systematic survey on the way in which non -mathematical reasoning data, such as legal analysis, social sciences or historical interpretation, have an impact on the effectiveness of RL training always represent an important research gap.
Challenges in the diversification of the areas of reasoning
Recent research has explored methods to diversify sets of RL training data, However, questions about optimal data mixing strategies and the relative importance of various sources remain unanswered. A fundamental challenge in the application of RL to general reasoning tasks is to develop verifiable reward models for areas without deterministic solutions. The reasoning processes specific to the domain – whether based on rules and symbolic in mathematics or contextual and heuristic in fields such as law and history – require different cognitive approaches. In addition to this, the formats of questions (open or multiple choice) require distinct reasoning strategies, which suggests that the incorporation of various fields of reasoning could considerably improve the major cognitive capacities of LLMS.
Nemotron-Crossthink: a multi-domain approach
Researchers from Nvidia, Carnegie Mellon University and the University of Boston present Nemotron-CrossSthink,, Representing a systematic framework to incorporate multi-domain corpus into RL training to improve the generalization of the crossed task. The methodology follows a complete pipeline which organizes various data sources, including the synthetic data of Commoncrawl and pairs of open source questions through STEMHumanities, Law and Social Sciences. By applying model formats (MCQ / Open-End) to force response spaces, filtering samples for verifiable rewards and implementing strategic data mixing recipes, the frame allows effective self-learning via RL in various fields of reasoning.
Key results and innovations
Nemotron-Crossthink considerably improves LLM reasoning capacities by integrating multi-domain data with different questions. The models formed with this approach not only demonstrate higher accuracy but also dynamic response strategies – generating concise answers for general questions while providing detailed answers for mathematical problems – optimize inference costs while maintaining specific task accuracy.
The framework is the challenge of verifiable rewards in non -deterministic fields thanks to the conservation of the data of the models which limits the diversity of the space. It also provides an effective filtering approach which classifies the reasoning data for general use by complexity, Show that training with more difficult samples amplifies the RL impact in all areas. These innovations have led to substantial performance gains in both mathematical benchmarks (MATH-500: + 30.1%, AMC23: + 27.5%) and non-mathematical tasks (MMLU-Pro: + 12.8%, GPQA-DIAMOND: + 11.3%).
Full data
Nemotron-Crossthink begins with the meticulous conservation of data from several sources to ensure diversity. The training data set combines data generated synthetically from Open-Source QA data sets available and accessible to the public, encompassing both general use reasoning and mathematical content. The reasoning data for general use includes MMLU, natural reasoning and synthesized AQ pairs covering the STEM fields, economics, social sciences and human sciences, while mathematical reasoning incorporates sets of data such as mathematics and Numin-Math alongside problems generated synthetically.
Application and data filtering template
To meet the challenge of verifiable rewards in non-mathematical fields, the framework applies specific models to structure the questions of answers: multiple choice questions (MCQ) and open questions. This approach exposes the model to various response formats and reasoning pathways while limiting the variability of the response space to allow effective modeling of rewards. Rigorous filtering removes the samples which are unrealizable to assess with rules based on rules, throwing the SCM where the correct answers are not part of the choices and open responses exceeding ten words.
Mixture of strategic data and learning to strengthen
Nemotron-CrossSthink uses the relative optimization of group policy (GRPO) for learning strengthening, which improves efficiency by considering the base lines of group scores rather than using a distinct critical model. The methodology studies the impact of various data sources, types of questions and the usefulness of data thanks to six separate mixture recipes. This systematic approach allows a detailed analysis of how the reasoning data for general use complete mathematical reasoning, ultimately producing more adaptable and generalizable language models.
Technical contributions
Research demonstrates several key technical progress in multi-domain reasoning by learning to strengthen:
- Response formats to questions of models provide more stable reward modeling, with unified open questions improving the performance of 1.21% compared to mixed formats, and short-circuit response models outperforming long models of 1.20%.
- The strategic mixture of data is essential, multi-domain corpus increasing the accuracy of average reasoning by 1.61% compared to math training while reducing the use of tokens by 28%.
- The filtering techniques controlled by the model effectively select difficult samples by deleting solutables by smaller models, which gives an additional precision gain of 2.15% for QWEN-2.5-32B.
These results represent significant progress in the development of LLM with robust reasoning capacities in various fields, going beyond traditional accent on mathematical reasoning to encompass all the spectrum of human knowledge and models of inference.
Experiences and results
The experimental results demonstrate that different sets of data have a significant impact on the performance of the model through reasoning benchmarks. Numanamath produced the highest overall average, surpassing the basic line of 8.30%, with a particular force in mathematical tasks while generalizing well in various fields. The synthetic data of questions of questions improved the performance of approximately 1.0%, showing high precision in the MMLU-Pro, Agieval and Math-500 tasks, confirming that the instruction style data generated by synthesis can generalize effectively when it is aligned on reference distributions.
Nemotron-Crossthink's approach systematically surpassed the basic model through various mixing strategies. The mixture of reasoning for general use (BGPR ↑) has reached the highest overall average, exceeding the opening zero of opening of around 5% on average and showing substantial earnings on the marks focused on reasoning ( + 12.82% on MMLU-Pro, + 15.12% on the AGVAL). Although Bonly_Math has worked slightly on strictly mathematical tasks, she is late on non -mathematical reasoning benchmarks, demonstrating the higher versatility of BGPR ↑ thanks to a strong transfer of cross domain.
A more in -depth analysis revealed that the formats of open questions (Bopen ↑) have given stronger results on mathematical benchmarks than multiple choice formats (BMCQ ↑), suggesting an alignment with the intrinsically open structure of mathematical problems. Mathematical reasoning data has shown transferability to structured reasoning tasks, while data for general use has been less effective in isolation. This counter-intuitive observation confirms that the optimal performance of reasoning for general use requires mathematical problems in training mixtures.
Conclusion
Nemotron-Crossthink introduces an evolutionary framework which improves the generalization of LLM by learning to strengthen with multi-domain corpus. By strategically mixing various reasoning data with a 2: 1 ratio of the content of the general effort with mathematical content, the approach achieves a remarkable average improvement of 13.36% compared to the basic lines. Research shows that the diversity of data, not only volume, causes wider reasoning capacities. Thanks to the filtering based on the difficulty and design of reflected models, Nemotron-CrossSthink establishes a practical methodology to develop more generalizable, effective and reliable LLM which extend self-learning beyond mathematical reasoning.
Discover the Paper And Project page. Also, don't forget to follow us Twitter.
Here is a brief overview of what we build on Marktechpost:
