Recent progress in models of large languages (LLM) centered on reasoning have widened the scope of strengthening learning (RL) beyond close and specific applications, allowing larger generalization and reasoning capacities. However, this change introduces significant challenges, in particular in the scaling of the training calculation required to learn from experience. Unlike learning imitation by pre-training and fine adjustment, RL requires a more intensive approach in calculation. A central problem is the decline in political entropy, which affects the balance between the exploitation of known strategies and the exploration of news. This exploration compromise is fundamental in RL, and the control of political entropy has become essential to maintain effective exploration during training.
Existing efforts approach the exploration exploitation compromise in RL using political entropy. The maximum RL entropy introduces a regularization term to the reward function, promoting uncertainty in the selection of action and encouraging a broader exploration. Although this technique has been largely adopted in conventional RL algorithms, its application to LLMS remains debated. In addition, predictability in RL for LLM is not explored. Although the laws on the neural LLM development scale, similar predictors for RL training remain limited. The existing RLM methods for LLM with verifiable rewards are promising in improvements in reasoning, but lack in -depth understanding of their basic mechanisms.
Researchers from Shanghai AI Laboratory, Tsinghua University, UIUC, Beijing University, Nanjing University and Cuhk provide an approach to combating the collapse of political entropy in RL for LLM centered on reasoning. They have established a transformation equation, R = −A Exp H + B, where H is entropy, R is downstream performance and A and B are adjusted coefficients. This empirical law firmly suggests that the performance of policies is negotiated from political entropy, thus a bottleneck by its exhaustion. The researchers study the dynamics of entropy, and their derivation emphasizes that change in political entropy is motivated by covariance between the probability of action and change in logits. They also proposed two techniques, namely Clip-Cov and KL-COV, which apply and apply a KL penalty to chips with high covariances, respectively.
To study and validate the phenomenon of collapse of entropy in the LLM of the RL, the researchers applied RL to LLM on verifiable tasks, such as mathematics and coding, using an autoregressive generation configuration where models produce sequences of token -based tokens. The study includes 11 open-source models widely adopted covering four families: Qwen2.5, Mistral, Llama and Deepseek, with parameters ranging from 0.5b to 32 B. The evaluations are carried out on eight public references, in particular Math500, loves 2025, AMC and EURUS-2-RL. In addition, the RL training follows the VERL frame in a zero-shot parameter, using algorithms like GRPO, strengthens ++ and premium to optimize the performance of the policy while observing the dynamics of entropy.
The proposed Clip-Cov and KL-COV techniques were assessed on QWEN2.5 models using the Dapomath data set for mathematical tasks. These methods obtain non -trivial performance gains in all landmarks. Compared to the GRPO base line, these methods improve the performance of 2.0% on average for the 7B model and 6.4% for the 32B model. For example, when the entropy of the basic line reaches a tray, the KL-COV method still supports a level of entropy in 10 times higher. The methods can maintain a higher level of entropy throughout the training. In addition, the methods give more substantial gains on the QWEN2.5-32B model, with improvements of 15.0% and 14.6% compared to GRPO on the most difficult landmarks, loves24 and loves25, respectively.
In conclusion, the researchers overcome the challenge of the collapse of political entropy in RL for LLM centered on reasoning. The results highlight a compromise between improving performance and decreasing exploration, which ultimately limits additional gains. Thanks to theoretical analysis and empirical validation, researchers identify the dynamics of entropy as a key to key strangulation and offer two effective regularization strategies-Clip-COV and KL-COV to manage tokens with strong covariance and sustainable exploration. As RL appears as a crucial axis for the scale beyond pre-training, the collapse of entropy becomes essential. This work provides fundamental information on the role of entropy, guiding future efforts to evolve RL to smarter and capable language models.
Discover the Paper And GitHub page . All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.
Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.
