RL ^ V: unifying reasoning and verification in language models thanks to learning in value without value

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

The LLMs have acquired exceptional reasoning capacities thanks to strengthening learning (RL) on accuracy awards. Modern RLM algorithms for LLM, including GRPO, Vinepp and PPO left, have moved away from traditional PPO approaches by eliminating the network of value -learned value functions in favor of empirically estimated yields. This reduces requests for calculation and consumption of GPU memory, which makes RL training more feasible with increasingly large models. However, this efficiency is accompanied by a compromise – the value function could serve as a powerful results verifier to assess the accuracy of the reasoning chain. Without this component, the LLM lose a precious verification capacity which could improve inference thanks to parallel research strategies like the best of N or the weighted majority vote.

The recent progress in LLM reasoning has explored various RL techniques, with traditional PPO algorithms showing the usefulness of the value model as a test research verifier. However, the growing trend towards RL methods “without value” (GRPO, VINEPP, PPO left to one) eliminates this capacity while requiring general formal model training costs. The test verification approaches are alternatives to improve reasoning through calculation scaling, including models formed via a binary classification, learning preferences or prediction techniques next to it. But these models require large training data sets, additional calculation resources and considerable GPU memory during inference.

Researchers from McGill University, the University of Montreal, Microsoft Research and Google Deepmind have proposed RLV To approach the potential of value -type signals in RL for LLM. RLV Increases “worthless” methods with a generative verifier without compromising the scalability of the training. RLV Use the generation capacities of the LLM using the abundant data produced during the RL training to optimize the model both as a reasoning and verifier. This dual -function approach supervises verification as a prediction task in Tarker, allowing the same LLM to generate solutions while providing an intrinsic score. The initial results show RLV Increase the precision of mathematics by more than 20% compared to basic RL methods when using parallel sampling, reaching a test calculation scale from 8 to 32 times more efficient.

RLV Unifies a reasoning and a generative verifier within a single LLM, addressing four key research questions on the scaling of the parallel testing time, the training methodologies of verifiers, the strategies for using the test time and the interactions with the sequential scaling in the thought models. The configuration uses the Henyccks math data set for RL training, operating on 4 × A100 80g Nvidia Gpus for 3 hours with assessments reported on Math500, Math2GPQA and Benchmarks AMA'24. The researchers use the QWEN2.5 MATH 1.5B model, setting it with GRPO, the GRPO algorithms, Leave-ère-out and Vinepp with and without unified verification for a shorter COT experience. The training used a 1024-token context window, with an inference generating up to 1024 tokens for math500 and 2048 tokens for other test sets.

RLV Watch large capacities for scaling testing time, reaching efficiency up to 32 times higher and 4% higher accuracy than basic methods on MATH500 with 512 samples. Testing optimal verification strategies reveals that weighted vote surpasses the majority vote and the best approaches when sampling more than 8 solutions per problem for short and long COT models. RLV is complementary to the sequential inference calculation scale, with the GRPOV Method performing the highest success rates on the like 24 at longer generation lengths. The formation of the unified verifier requires meticulous balancing by the λ verification coefficient, which presents a significant compromise in the GRPOV Implementation – The increase in λ improves the precision of the verifier (from ~ 50% to ~ 80%).

In this article, researchers introduced RLVwhich incorporates verification into rl frames “without significant value” without significant general costs and shows improvements in the precision of reasoning, the efficiency of calculating the test time and the generalization of the cross domain between sets of mathematical data, mathematics, GPQA and likes 24. Future research guidelines could explore the improvement of the generative verifier to produce explanations of explicit cot, although this would require COT data specific to verification or dedicated RL training processes. The unified framework for generation and verification of solutions via RL establishes a precious basis for continuous progress in LLM reasoning capacities.


Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 90K + ML Subdreddit.

Here is a brief overview of what we build on Marktechpost:


Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.