The incorrect responses improve mathematical reasoning? Learning to strengthen verifiable rewards (RLVR) on surprises with QWEN2.5-MAT

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

In the treatment of natural language (NLP), RL methods, such as learning to strengthen human feedback (RLHF), were used to improve the model outputs by optimizing the responses according to the feedback signals. A specific variant, learning to strengthen verifiable rewards (RLVR), extends this approach using automatic signals, such as mathematical accuracy or syntactic characteristics, such as feedback, allowing large -scale adjustment of language models. RLVR is particularly interesting because it promises to improve the reasoning capacities of models without the need for extended human supervision. This intersection of automated feedback and reasoning tasks is an exciting field of research, where developers aim to discover how models can learn to reason mathematically, logically or structurally using limited supervision.

A persistent challenge in automatic learning is to create models that can reason effectively under minimum or imperfect supervision. In tasks such as mathematical problem solving, where the right answer may not be immediately available, researchers look at how to guide the learning of a model. Models often learn soil verification data, but it is not practical to label large data sets with perfect precision, in particular in reasoning tasks which require an understanding of complex structures such as evidence or programmatic steps. Consequently, there is an open question on the question of whether the models can learn to reason if they are exposed to noisy, misleading or even incorrect signals during training. This problem is important because the models that are based too much on perfect feedback may not be widespread when such a supervision is not available, thus limiting their usefulness in the scenarios of the real world.

Several existing techniques aim to improve the reasoning capacities of models thanks to the learning of strengthening (RL), RLVR being a key objective. Traditionally, RLVR has used “Truth on the ground” labels, correct answers verified by humans or automated tools, to provide rewards during training. Some approaches have softened this requirement using majority voting labels or a heuristic based on a simple format, such as gratifying responses that follow a specific output style. Other methods have experienced random rewards, offering positive signals without considering the accuracy of the response. These methods aim to explore if the models can learn even with a minimum of advice, but they mainly focus on specific models, such as Qwen, raising concerns about generalization in different architectures.

Researchers from the University of Washington, the Allen Institute for AI and the UC Berkeley are investigating this issue by testing various reward signals on QWEN2.5-MATH, a family of large-language models adjusted for mathematical reasoning. They have tested awards on the ground, majority voting awards, format awards based on box expressions, random awards and incorrect awards. Remarkably, they observed that even completely parasitic signals, such as rewards and random rewards for bad answers, could lead to substantial performance gains in QWEN models. For example, the QWEN2.5-MATH-7B training on Math-500 with awards through the ground has given an improvement of 28.8%, while the use of incorrect labels resulted in 24.6%gain. Random awards still increased an increase of 21.4%, and format awards resulted in an improvement of 16.4%. The majority voting awards provided a precision gain of 26.5%. These improvements were not limited to a single model; QWEN2.5-MATH-1.5BE also shown strong gains: format awards increased precision by 17.6% and incorrect labels by 24.4%. However, the same reward strategies have failed to offer similar advantages to other families of models, such as LLAMA3 and OLMO2, which have shown minimum or negative changes when they are formed with parasitic rewards. For example, LLAMA3.1-8B has seen performance reductions of up to 8.5% under certain parasitic signals, highlighting the nature specific to the model of the improvements observed.

The approach of the research team involved using the RLVR training to refine the models with these varied reward signals, replacing the need for ground supervision with heuristic or randomized feedback. They found that Qwen models, even without access to correct answers, could always learn to produce high quality reasoning outings. A key overview was that Qwen models tended to present a separate behavior called “code reasoning”, generating structured mathematical solutions such as code, in particular in Python type formats, that the reward signal was significant. This tendency in code reasoning has become more frequent on training, from 66.7% to more than 90% in QWEN2.5-MATH-7B when it is formed with parasitic rewards. The responses which included code reasoning showed higher precision rates, often around 64%, against only 29% for responses without these reasoning models. These models have emerged consistently, suggesting that parasitic rewards can unlock latent capacities learned during pre-training rather than introducing new reasoning skills.

Performance data highlighted the surprising robustness of Qwen models. Gains from random awards (21.4%on Math-500) and incorrect labels (24.6%) almost equaled the Truth reward on the ground of 28.8%. Similar trends have appeared through the tasks, such as the AMC, where the format, the erroneous and erroneous random awards produced around an improvement of 18%, only slightly lower than the improvement of 25% compared to the rewards for the verification of the soil or the majority. Even on AIMA2024, parasitic awards such as the format (+ 13.0%), incorrect (+ 8.7%) and random (+ 6.3%) led to significant gains, although the advantage of soil spinning labels (+ 12.8%) has remained obvious, in particular for questions like2025 created after the model of model sampling.

Several key research dishes include:

  • QWEN2.5-MATH-7B gained 28.8% precision on Math-500 with awards through the ground, but also 24.6% with incorrect awards, 21.4% with random awards, 16.4% with format awards and 26.5% with majority voting awards.
  • Code reasoning models emerged in Qwen models, from 66.7% to 90% + under RLVR, which brought precision from 29% to 64%.
  • Non qwen models, such as LLAMA3 and OLMO2, have not shown similar improvements, with LLAMA3.1-8B, with up to 8.5% of performance reductions on parasitic rewards.
  • The parasitic signal gains appeared in the 50 stages of training in many cases, suggesting a rapid elicitation of the reasoning capacities.
  • Research warns that RLVR studies should avoid generalizing results based on QWEN models, because the efficiency of the parasitic reward is not universal.

In conclusion, these results suggest that although Qwen models can take advantage of parasitic signals to improve performance, it is not true for other families of models. Non qwen models, such as LLAMA3 and OLMO2, have shown changes in flat or negative performance when formed with parasitic signals. Research highlights the importance of validating RLVR methods on various models rather than relying only on QWEN centered results, as many recent items have done.


Discover the Paper,, Official release And GitHub page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.


Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.