Political gradient methods have considerably advanced LLM reasoning capacities, in particular via RL. A key tool to stabilize these methods is the regularization of Kullback-Lebler (KL), which discourages drastic changes between current policy and reference policy. Although widely used in algorithms like PPO, there is still much to explore in the way different variants KL, such as KL front, KL inverse and their non -normalized shape, can be estimated and applied in loss functions. These choices, as well as various gradient estimators and policy parameters on policy, stability and performance of the form of the form in a nuanced and under-explored manner.
The LLM with fine adjustment with human feedback are crucial for the construction of aligned AI systems. Two main strategies are used: optimization with reward models using policy gradient methods, such as PPO, and direct training on human preferences thanks to methods such as direct optimization of preferences (DPO). While PPO stabilizes training with reward models, the DPO and its variants use comparisons per pairs to simplify and put learning on popularity in recent models. Learning strengthening is also increasingly used to improve LLM reasoning, especially in complex tasks such as mathematics and coding. The new methods aim to reduce calculation costs and improve the stability of the training, often by replacing value networks or by modifying KL penalties.
Researchers from UCLA, Tsinghua University and Shanghai Qi ZHI introduce a regularized policy gradient (RPG), a unified framework for policy gradients regularized by KL in online online learning. They derive from policy gradients and substitution loss functions using both before and reversed KL differences, tackled standardized and non -standardized policies. RPG takes care of both fully differentiated objectives and reinforcing style estimators, adapted to out -of -policy training with important sampling. The study also identifies and addresses theoretical problems in existing methods, such as GRPO, and examines the regularization of KL in Reinforcement ++. Experiences on LLM reasoning tasks demonstrate that RPG achieves improved stability and performance compared to basic lines, including GRPO, reinforces ++ and DAPO.
The study presents policy gradient methods which integrate the regularization of the divergence of KL in online and out -of -policy parameters using importance sampling from an older policy. For the front KL, the gradient implies weighted rewards according to the importance and a term of regularization, its loss resembling the loss of maximum likelihood when the rewards are zero. The non -standardized front KL adds a correction for the incompatible distribution masses. Similarly, KL reversed and its innormalized form penalizes the gap in relation to the reference policy, modifying the reward according to the log-trusage ratios. All approaches share a strengthening -type gradient structure, allowing alternative implementations using the stop operator, which supports stable and effective optimization in practice.
The researchers carried out an in -depth evaluation of their proposed RPG methods – both differentiable and reinforced style – by comparing them with several baselines established on complex mathematical reasoning tasks using QWEN2.5 language models. They trained on the Dapo-Math-17k data set and evaluated performance using benchmarks such as AMC23 and loves. RPG variants have constantly demonstrated high precision, training stability and effective use of memory. The implementation used the frame and the VERL techniques such as the regularization of KL, the PPO style clipping and the Adamw without planning for more fluid optimization. RPG models have generally surpassed others in shaping the rewards, control of entropy and the length of the response, highlighting their robustness and their relevance for stable and high performance learning.
In conclusion, RPG is a complete framework for the design and analysis of policy gradient methods which integrate KL-Regularization into online learning of reinforcement outside politics. They explore a range of configurations, including the front and inverted KL differences, distributions of standardized and non -standardized strategies, and two types of estimators: fully differentiaiable and reinforcing style. RPG aims to provide a structured approach to understand and implement these variations. Applied to reasoning tasks with models of large languages, the methods proposed demonstrate more stable training and competitive or improved performance compared to the established basic lines, such as GRPO, reinforces ++ and DAPO.
Discover the Paper And GitHub page . All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.
Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.
