Limits of learning to strengthen in areas of narrow reasoning
The strengthening of RL learning has shown great potential to improve LLM reasoning capacities, especially in leading systems such as Openai-O3 and Deepseek-R1. However, most research on RL has concentrated closely on mathematics and code, which limits its general applicability. This narrow scope poses two problems: our understanding of how RL improves reasoning may not generalize beyond these areas, and resulting models often lack versatility. The widening of the RL has wider reasoning tasks is difficult due to a lack of reliable reward signals and organized data sets, which are easier to define in mathematical terms and based on code, but more difficult in the fields of open reasoning.
Focus and challenges of generalization of the narrow domain
RL learning strengthening has become a popular method to improve LLM reasoning skills, especially after success with models like GPT-3 of Openai and Deepseek-R1. Many open source efforts have followed, mainly focusing on mathematical and co -editing areas. Although these models work well in their niches, their reasoning does not always generalize to wider tasks. At the same time, research explored how RL influences reasoning. Some studies suggest that RL does not teach new skills but stimulates the capacity of the model to access existing reasoning models. However, new works indicate that the extended RL training can completely unlock new reasoning strategies.
Introduction of the Guru data set: a multi-domain RL reference
Researchers from UC San Diego, Mbzuai, Carnegie Mellon and Purdue introduce Guru, a set of RL data at 92 k-Example covering six areas of reasoning: mathematics, code, science, logic, simulation and tabular. Each area is carefully built with tailor -made reward functions and rigorous filtering. The training models on Guru reveal that RL's results depend strongly on the familiarity of the field: the common areas benefit from a transversal RL, while those unknown require training in the field to improve considerably. Their models, Guru-7b and Guru-32B, surpass anterior open models up to 7.9% on 17 tasks. These results highlight the specific effects in the RL domain and the value of wide multi-domain reasoning references.
Learning effects for strengthening the inter-domain area vs in the field
To better understand how RL supports reasoning through the domains, researchers have formed models on individual data and in the mixed field of the Guru data set. They found that areas such as mathematics, the code and science benefited more from a transversal RL, probably because of their stronger presence in pre-training. Training in the mixed field has also done or better than single domain training, showing that the combination of various tasks can improve general reasoning. However, training only on more difficult examples has improved performance in this field but has reduced precision on simpler functions in others. These results suggest that the diversity of data and balanced difficulties are essential for effective and transferable reasoning skills.
Architecture of the Guru model and evaluation strategy
The study formed 7B and 32 models of size B using the Guru data set to explore how the combination of several areas during the RL improves reasoning capacities. Using the Verl framework and the GRPO algorithm, the models were evaluated on a wide range of tasks, including mathematics, code, logic, science, simulation and tables, using consistent measurements. The results have shown that the Guru models have surpassed the reference bases specific to the domain and worked well on invisible tasks. In particular, the analysis of Pass @ K revealed that the performance depends on the type of task, the size of the model and the decoding parameters. The larger models benefited from more RL and refining sampling parameters, such as temperature and top-P, have helped improve the diversity of models and the coverage of reasoning.

Summary: Reasoning for general use with Guru
In conclusion, Guru is an organized RL data set containing 92,000 high -quality verifiable examples in six reasoning areas: mathematics, code, science, logic, simulation and tabular. Unlike the previous research of RL, which focused mainly on mathematics and the code, Guru allows wider reasoning studies by providing specific reward signals to the field. The researchers form two models, Guru-7b and Guru-32B, which obtain cutting-edge results on 17 reference tasks, in particular in the under-represented fields during pre-training. Their results show that RL can both refine existing knowledge and promote new reasoning capacities. All data, models and code are published publicly to support a search for reasoning for additional general use.
Discover the Paper, Project page And GitHub page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.
Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.
