The selection of high entropy tokens in learning to strengthen with verifiable rewards (RLVR) improves precision and reduces the cost of training for LLM

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

The models of large languages ​​(LLM) generate responses step by step known as the reflection chain (COTS), where each token contributes to a coherent and logical story. To improve the quality of reasoning, various strengthening learning techniques have been used. These methods allow the model to learn feedback mechanisms by aligning the outputs generated with the accuracy criteria. As LLM develops in complexity and capacity, researchers have started to probe the internal structure of the generation of tokens to discern models that improve or limit performance. An area attracts attention is the distribution of token entropy, a measure of uncertainty in the prediction of tokens, which is now linked to the capacity of the model to make significant logical decisions during reasoning.

A basic problem in the formation of reasoning models using the learning of strengthening is to treat all the output tokens as well. When the models are optimized using learning to build with verifiable rewards (RLVR), the update process traditionally includes each token in the generated sequence, regardless of its functional role. This uniform treatment does not distinguish tokens which lead to significant displacements of reasoning from those which simply extend the existing linguistic structures. Consequently, a large part of the training resources can contact tokens that offer a minimum contribution to the model's reasoning capacities. Without prioritizing the few tokens that play a decisive role in navigation on different logic paths, these methods lack targeted and effective optimization opportunities.

Most RLVR executives, including the optimization of proximal policy (PPO), the optimization of the relative group policy (GRPO) and the optimization of the dynamic sampling policy (DAPO), the function by evaluating whole sequences of tokens outings against the reward functions that evaluate accuracy. PPO relies on the stabilization of policy updates via a cut objective function. GRPO improves this by estimating the values ​​of advantage by using group responses, rather than a network of separate value. DAPO introduces additional improvements, such as the higher mechanism and the formatting too long award. These methods, however, do not take into account entropy at the token level or do not distinguish the importance of individual tokens in the reasoning chain, applying rather uniform gradient updates at all levels.

In an attempt to refine how RLVR training has an impact on LLM reasoning, researchers from Alibaba Inc. and Tsinghua University presented a new methodology focused on token entropy models. They observed that in the COT sequences generated by the Qwen3 models, a small subset of tokens, about 20%, have a significantly higher entropy. These tokens, labeled “tokens filing”, often correspond to moments when the model must decide between several reasoning paths. The remaining 80% of tokens generally have low entropy and act as extensions of previous declarations. By limiting updates to the policy gradient only to these strong entropy tokens, the research team has not only maintained, but, in many cases, to improve performance on difficult reasoning references.

To quantify the token entropy, the researchers used the entropy formula according to the distribution of probability on choices of possible tokens at each stage. They found that more than half of all the tokens generated had entropy values ​​below 0.01, indicating almost deterministic behavior. Only 20% exceeded an entropy of 0.672, marking them like decision -making centers in COTS. High entropy tokens often include logical operators and connective words such as “supposing”, “since”, or “thus”, which introduce new conditions or transitions into logic. On the other hand, small entropy tokens included symbols, suffixes or predictable code fragments. Thanks to controlled experiments, it has become clear that the manipulation of the entropy of these supply tokens directly influenced the model reasoning performance, while the modification of tokens with low entropy had little effect.

The research team conducted in-depth experiences on three model sizes: QWEN3-8B, QWEN3-14B and QWEN3-32B. During the training only for 20% entropy tokens, the QWEN3-32B model obtained a score of 63.5 on AIME'24 and 56.7 on AMA'25, both defining new performance references for models under 600b parameters. In addition, the increase in the maximum response length from 20k to 29k increased the AMA'24 score to 68.1. In comparison, the 80% lower training in chips with low entropy caused a significant drop in performance. The QWEN3-14B model showed +4.79 gains on Aima'25 and +5.21 on Aime'24, while QWEN3-8B ​​has maintained competitive results compared to complete training. An ablation study also confirmed the importance of keeping the 20%threshold. By reducing the fraction to 10% of essential decision points, and by increasing it to 50% or 100% the effect by including too many token with low entropy, thus reducing the diversity of entropy and hinders exploration.

In essence, research provides a new direction to improve the reasoning capacities of language models by identifying and selectively training on the minority of tokens which contribute in a disproportionate way to the success of reasoning. It avoids ineffective training and rather offers an evolutionary approach which aligns objectives for learning to strengthen with moments of real decision -making in tokens sequences. The success of this strategy lies in the use of entropy as a guide to distinguish useful tokens from filling.

Several key research dishes include:

  • About 20% of tokens have high entropy and serve as supply points which are direct reasoning paths.
  • Training only on these high entropy tokens provides equal performance or better than training on the whole of full token.
  • Qwen3-32B obtained scores of 63.5 on Aime'24 and 56.7 on Aime'25, surpassing larger models traditionally formed.
  • The length of the response from 20K to 29k further pushed the AMA'24 scoring to 68.1.
  • Training on the remaining 80% of chips with low entropy led to a strong degradation of performance.
  • Keep the 20% threshold for tokens with strong entropy optimally balances exploration and performance.
  • The larger models derive more from this strategy because of their ability to benefit from improved exploration.
  • The strategy is changing well and could guide more effective training in new generation reasoning models.

In conclusion, this research effectively rethinks the application of learning to strengthen language models by introducing a concentration on entropy in the tokens. By optimizing only the minority which influences the reasoning paths, the method improves performance while reducing the general calculation costs. It provides a practical roadmap for future efforts to improve reasoning in LLMs without useless complexity.


Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 98K + ML Subdreddit and subscribe to Our newsletter.


Nikhil is an intern consultant at Marktechpost. It pursues a double degree integrated into materials at the Indian Kharagpur Institute of Technology. Nikhil is an IA / ML enthusiast who is still looking for applications in fields like biomaterials and biomedical sciences. With strong experience in material science, he explores new progress and creates opportunities to contribute.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.