Can LLM really judge with reasoning? Microsoft and Tsinghua researchers introduce reward reasoning models to dynamically calculate the testing of test time for better alignment

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Strengthening learning (RL) has become a fundamental approach in the LLM after training, using human feedback supervision (RLHF) or verifiable rewards (RLVR). Although RLVR is promising in mathematical reasoning, it faces important constraints due to dependence on training requests with verifiable responses. This requirement limits applications to large -scale training on requests from the general domain where verification is insoluble. In addition, current reward models, classified as scalar and generative types, cannot effectively set the testing time calculation for the reward estimation. Existing approaches apply uniform calculation resources on all inputs, lacking adaptability to allocate additional resources to difficult requests requiring nuanced analysis.

Formulation strategies and rating patterns characterize reward models. Digital approaches attribute scalar scores to questions and response, while generative methods produce a natural language feedback. The rating follows the absolute evaluation of individual pairs or the discriminating comparison of candidates' responses. Generative reward models, aligned with the LLM-Aa-Judge paradigm, offer interpretable comments but face reliability problems due to biases. The time-scaleing methods dynamically adjust calculation resources, including parallel strategies such as multi-sample and scaling based on the horizon for prolonged traces of reasoning. However, they lack systematic adaptation to the complexity of the entries, limiting their effectiveness through various types of request.

Researchers from Microsoft Research, Tsinghua University and the University of Beijing have proposed reward reasoning models (RRMS), which carry out explicit reasoning before producing final awards. This reasoning phase allows RRMS to allocate additional calculation resources in an adaptive way when assessing responses to complex tasks. The CRMs introduce a dimension to improve the modeling of rewards by expanding the calculation of the test time while maintaining a general applicability in various evaluation scenarios. Thanks to a reasoning of the chain of thoughts, the RRMS uses an additional testing time calculation for complex requests, if necessary, the appropriate rewards are not immediately apparent. This encourages RRMS to the evolution of the self-evolution of reasoning capacities without traces of explicit reasoning as training data.

RRMS uses the QWEN2 model with a transformer decoder skeleton, formulating the reward modeling such as the completion of the text where the CRRS generately generate reflection processes followed by the final judgments. Each entry contains a request and two responses to determine the preference without allowing links. Researchers use the reward to guide the systematic analysis between the evaluation criteria, in particular the loyalty of the instructions, the utility, the accuracy, the inability and the level of detail. The RRM support the multi-response assessment via ELO rating systems and direct elimination tournaments, both combinable with the majority vote for improved use of test test calculation. This samples several times for pairs comparisons, making a majority vote to obtain robust comparison results.

The results of the evaluation show that RRMs obtain competitive performance against strong basic lines on the reward and test references of Pandalm, RRM-32B reaching a precision of 98.6% in the reasoning categories. Comparison with Directjudge models formed on identical data reveals substantial performance gaps, which effectively indicates the CRRM effectively uses test calculation for complex requests. In the inference of the best of the reward guided by the rewards, the SRRMs exceed all the basic models without calculation of additional tests, the majority vote offering substantial improvements between the sub-assemblies evaluated. Post-training experiences show regular improvements in downstream performance on MMLU-Pro and GPQA. The scaling experiences on 7B, 14B and 32B models confirm that the longer horizons constantly improve precision.

In conclusion, the researchers introduced the RRMS to carry out explicit reasoning processes before the allocation of award to process computer inflexibility in existing reward modeling approaches. RL-recompress based on rules allows RRMS to develop complex reasoning capacities without requiring explicit traces of reasoning as supervision. RRM effectively use testing of testing time by parallel and sequential scaling approaches. The effectiveness of RRMs in practical applications, including the inference of the best of the reward and post-training feedback, demonstrates their potential as a strong alternative to traditional scalar reward models in alignment techniques.


Discover the Paper And Models on the embraced face. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.


Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.