Introduction: progress of learning to strengthen through an invitation to the chain of thoughts
The LLMs have shown excellent progress in complex reasoning tasks thanks to an incentive to the COT combined with learning by large -scale strengthening (RL). Models like Deepseek-R1-Zero have shown solid reasoning capacities by applying RL directly to the basic models. Likewise, methods such as Simplerl and Open-Reasonerzero show improvements in smaller models like the Qwen series. However, the success of different families of basic models remains a challenge. In addition, the application of a zero r1 style formation to basic models such as the LLAMA series is faced with difficulties, asking a fundamental question about the underlying factors that lead different basic models to behave inconsistently when learning strengthening.
RL scale limits on Lama models
Large -scale RL progress in models such as O1, O3 of Openai and R1 from Deepseek on mathematics problems in terms of competition, motivating RL exploration on smaller models with less than 100 billion parameters. However, they are limited to the family of Qwen models, while the replication of results on families such as Llama is difficult. The lack of transparency in pre-training pipelines has made it difficult to understand how pre-training influences the RL scaling. This caused unconventional studies, which found that the incitement to a blow improves reasoning in Qwen but offers few advantages in the Lama. The efforts to organize high-quality mathematical pre-training pre-training corpus through projects such as OpenWebmath, Mathpile, Infim-Web-Math and Finemath have made progress but remain limited to the scale under the 100B tokens.

Explore the training at mid-training with a stable strategy in decrease
Researchers from the University of Shanghai Jiao Tong study how mid-training training strategies shape the RL dynamics, focusing on Qwen and Llama. The study has several ideas: high quality high quality mathematics corpus such as Megamath-Web-Pro boost both the basic model and RL results. Second, the use of QA style data, in particular those with a long reasoning in COT, further improves the RL results. Third, the long COT introduces verbosity and instability in RL training. Finally, the application of scaling during median training leads to stronger downstream RL performance. The researchers introduce a half-training training strategy in two stages called Stable-Then-de-Diay, where the basic models are first formed on 200B tokens, followed by 20B tokens on three branches focused on cottons, resulting in octothinker models which have high RL compatibility.
RL configuration and reference evaluation
Researchers use MATH8K data set for RL training prompts. The configuration includes a size of the global training batch of 128, 16 deployment responses by request, and a size of PPO mini-pins of 64, with experiences carried out on the LLAMA-3.2-3B-Base and QWEN2.5-3B-Base models. For the evaluation, an invitation to a few shots is used for basic language models and zero models for RL models through indicator tasks, including GSM8K, Math500, Olympiadbench and AMC23. During the RL training, QWEN models have increasing response durations that remain reasonable throughout, while Llama displays abnormal behavior, with an average response durations at 4,096 tokens. The evaluation also reveals that QWEN2.5-3B has adjusted RL reaches improvements between benchmarks, while LLAMA-3.2-3B shows only marginal gains.
Octathinker surpasses the lama in RL compatibility
Each Octathinker branch demonstrates an improvement from 10% to 20% compared to the original LLAMA basic model and consistent gains compared to the stable model at all sizes when evaluated on 13 mathematical markers. Octothinker-Zero families reveal various reflection behaviors during the RL scaling, with strong performance of the Octothinker-Long variant. During the comparison of three basic models on a 3B scale during the RL training, Octathinker-Long-3B surpasses the original LLAMA-3.2-3B model and reaches performance parity with QWEN2.5-3B, a model known for solid reasoning capacities and in-depth pre-training. The hybrid and short branches show slightly lower performance, in particular on difficult references
Conclusion and future work: towards foundation models ready for RL
This article examines why basic models such as Llama and Qwen have divergent behaviors during the RL for reasoning, showing that half-training training plays a major role in the scalability of RL. The half-training training strategy in two stages transforms Llama into a foundation model better suited to RL, which leads to octothinker models. Future research orientations include:
- Better quality mathematical corpora standing to improve mid-training training.
- Creation of friendly basic models in RL using open recipes without distillation from models of long bed reasoning.
- Separate the QA format and the content to understand their contributions individually.
- The expansion of the Octathinker family with new branches, such as reasoning integrated into the tools.
Discover the Paper,, Strengthered facial page And GitHub page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.
Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.
