The reward models are fundamental components to align the LLM with human feedback, but they are faced with the challenge of reward hacking problems. These models focus on superficial attributes such as response length or formatting rather than identifying real quality indicators such as billing and relevance. This problem arises because the standard training objectives do not make the difference between the parasitic correlations present in the training data and the real causal engines of the quality of the response. The inability to separate these factors leads to fragile reward models (RMS) which generate poorly aligned policies. In addition, there is a need for a method that uses a causal understanding of preferences to train RM which are sensitive to causal and invariant quality attributes with various parasitic indices.
Limits of existing RM approaches and the need for causal robustness
Existing methods are trying to solve reward hacking problems in standard RLHF systems that rely on Bradley-Terry or pairs classification methods. This includes architectural changes, such as Odin, adjustments to the policy and methods focused on data involving sets or coherence verifications. Recent methods inspired by causality use MMD regularization against pre-special parasitic factors or consider causal effects by corrected rewritings. However, these methods only target predetermined false factors, lacking unknown correlations. While increased strategies remain coarse and methods focused on evaluation do not keep the reward models with robust training mechanisms against various parasitic variations.
Cumin presentation: Causally robust reward model for LLMS
Researchers from Google Deepmind, McGill University and the Mila – Quebec AI Institute proposed Crome (causally robust reward modeling), a frame built on an explicit causal model of generation of responses. Crome Form RMS To differentiate authentic quality drivers from the surface indices by adding sets of data preferably with targeted counterfactual examples generated by LLM. In addition, it creates two types of synthetic training pairs: (a) Causal increases, which introduce changes according to specific causal attributes, such as invoice to apply sensitivity to real quality changes, and (b) the neutral increases that apply invariance along false attributes such as style by using shooting labels. Crome improves robustness, increasing the accuracy of reward up to 4.5%, improving safety and reasoning.
Technical approach: counterfactual increase and optimization of composite losses
The COME operates through two main phases: the generation of counterfactual data aware of the attributes based on a causal model and the formation of the reward model with a specialized loss on combined data. It provides a theoretical analysis on the way in which the causal increase isolates the real drivers of reward for parasitic correlates under an idealized model. CROME uses the ultrafeedback data set with counterfeiting generated using Gemini 2.0 Flash and assesses performance on RewardBench and Rewordbench. Researchers use various basic LLM in their experiences, including Gemma-2-9b-IT, QWEN2.5-7B and GeMma-2-2B for preference by pair and Bradley-Terry reward models, with an impact of alignment downstream through the best selection on multiple tasks.
Performance gains: from the WildGuardtest award
On RewardBench, Crome improves classification accuracy compared to RRM on various basic models, with significant security gains (up to 13.18%) and reasoning (up to 7.19%). Crome shows aggregated precision gains up to 9.1% on Rewordbench with Gemma-2-9B-IT in pair parameters and higher performance out of 21 of 23 transformations. In addition, it shows a lower decrease in the classification accuracy of RewardBench in Rewordbench compared to RRM (19.78% against 21.54%). Crome shows excellent safety improvements on WildGuardtest with selection of the best of N, reaching lower attacking ratios on harmful prompts while retaining similar refusal levels on benign invites.
Conclusion and future orientations in the increase in causal data
In conclusion, the researchers introduced Cromme, a causal frame that solves the reward hacking problems during the RM training. It uses two strategies for increasing targeted synthetic data: causal increases and neutral increases. Crome surpasses strong basic lines on several basic models and reward modeling techniques on the reward bench, and the upper robustness on the reformulator against parasitic correlations. This method of training focused on the conservation of the data set (ie Cromé) opens new research orientations in the generation of synthetic data for the formation of the basic model, where the verification of causal attributes could prove to be very beneficial for future developments in the robust alignment of language models.
Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.
Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.
