Recent progress in the model -focused language models have marked a major change in AI by widening the calculation of the test time. Reinforcement learning (RL) is crucial to develop reasoning capacities and mitigate reward hacking traps. However, a fundamental debate remains: if RL provides new reasoning capacities from a basic model or simply helps optimize the efficiency of sampling of existing solutions. Current research faces two critical limitations: (a) a strong dependence on specialized fields, such as mathematics, where models are often overestimated and restrict the exploration potential, and (b) the premature termination of RL training before models can fully develop new reasoning capacities, generally limiting training in hundreds of stages.
Reasoning models represent specialized AI systems that engage in detailed long -term processes before generating final responses. Deepseek and Kimi have detailed methodologies for the formation of reasoning models using learning to strengthen verifiable rewards (RLVR), making algorithms like Grpo, Mirror Descent and Popular Rloo. Recently, methods like Alphago and Alphazero have shown that AI agents can indefinitely improve their performance, showing that RL training helps agents to develop new techniques not present in their basic models. In addition, existing work wonders if the RL training really improves the capacity for reasoning in the LLM, arguing that RLVR fails to extend the reasoning capacity, as evidenced by the Pass @ K measures that show no improvement compared to the basic models.
NVIDIA researchers have proposed Prorl, a method designed to allow prolonged RL training periods, contributing to a more in -depth exploration of reasoning strategies. Prorl supports more than 2,000 training stages and scale training data on various tasks, such as mathematics, coding, scientific problems, logical puzzles and the following instructions. Using Prorl, researchers have developed Nemotron-Rearch-Reasoning-qwen-1.5b, the best 1.5B reasoning model in the world, which surpasses its basic model, Deepseek-r1-1.5b, and excellent on Deepseek-R1-7B through various history. He demonstrates that RL can discover really new solution ways not present in basic models when they have given sufficient training time and applied to new reasoning tasks, suggesting a real expansion of reasoning capacities beyond the initial training.
The researchers have built a set of diversified and verifiable training data covering 136,000 examples in five tasks: mathematics, code, stem, logical and teaching puzzles. The training uses the VERL frame for the implementation of RL, adopting improvements in the GRPO method proposed by DAPO. A wide range of evaluation benchmarks is used in several areas to test the proposed model: the assessment of mathematics includes AIM2024, AIM2025, AMC, Mathematics, Minerva mathematics and Olympial Banc; Coding assessment uses a Prime, Humanevalplus and Livecodebench validation set; Logic Puzzles Evaluation reserves 100 samples from reasoning gymnasium tasks, while STEM reasoning and instructions according to capacities are evaluated using organized subsets of GPQA Diamond and Ifeval respectively.
In mathematics, Nemotron-Research-Reasoning-qwen-1.5B reaches an average improvement of 15.7% between references, while competitive programming tasks show an improvement of 14.4% of the precision of Pass @ 1. The reasoning and teaching of the stems according to the domains lead to gains of 25.9% on the diamond gpqa and 22.0% on Ifeval. The model shows an improvement of 54.8% in reward, showing great precision on the reasoning of logical gym puzzles. Outside distribution evaluation reveals significant improvements in three invisible reasoning gymnasium tasks, highlighting effective generalization beyond the training distribution. Compared to models specialized in the Deepscaler-1.5b and Deepcoder-1.5b field, the model formed by Prorl obtains pass scores greater than 1 on mathematical references (+ 4.6%) and code (+ 6.5%).
In this article, the researchers introduced Prorl, which provides evidence that an extended stable RL training develops new reasoning patterns beyond the initial capacities of a basic model. Based on this method, researchers have developed Nemotron-Research-Reasoning-qwen-1.5b, the best 1.5B reasoning model in the world. Prorl demonstrates its ability to solve tasks where the basic models initially have trouble, showing that the extended RL training helps models to internalize the models of abstract reasoning, transferable beyond the training distributions. These results question the previous hypotheses on RL limitations and establish that sufficient training time with appropriate techniques can increase the limits of reasoning, opening the way to develop more capable models of reasoning.
Discover the Paper And Model page . All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.
Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.
