Synpref-40m and SkyWork-Reward-V2: evolving human alignment for advanced reward models

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Understand the limits of current reward models

Although the reward models play a crucial role in learning to strengthen human feedback (RLHF), many today's open models of today always find it difficult to reflect the entire range of complex human preferences. Even with sophisticated training techniques, significant progress has been limited. A major reason seems to be gaps in the current data sets, which are often too narrow, artificially generated or poorly verified. Although certain rules based on rules are effective for clear tasks such as mathematics or coding, they generally do not manage to capture a nuanced human judgment. In addition, common benchmarks like RewardBench become less reliable indicators of the performance of the real world, showing bad correlation with the success of downstream tasks.

Preference data creation challenges and new approaches

The creation of high -quality preferably data has traditionally reached human annotators, but this method takes time, expensive and sometimes inconsistent. To remedy this, recent techniques like Rlaif use LLM to automate annotations, sometimes even outperforming humans. More recent approaches aim to combine the forces of the two by integrating the data generated by LLM with labels verified by humans. Meanwhile, the reward models have gone from simple rating systems, such as the Bradley-Terry model, to more complex frames, including generating and direct optimization methods. Despite the availability of many models and sets of robust open data, the challenges persist in accurately capturing nuanced human preferences through various tasks and languages.

Presentation of Synpref-40m: large-scale human preference data set

Researchers from 2050 Research, Skywork IA, introduce Synpref-40m, a massive data set of 40 million pairs of preferences organized by a two-step human pipeline. Human annotators guarantee quality through strict verification, while LLMS increase data conservation using human management. From this, they develop SkyWork-Reward-V2, a family of eight reward models (parameters of 0.6b-8b) trained on a high quality subset of 26 M. These models obtain advanced results on seven leading markers, by exceeding alignment, security, objectivity and robustness. The study stresses that success comes not only from the volume of data, but from a meticulous and iterative conservation which mixes human expertise with the scalability of the AI.

Human-human conservation pipeline with two floors

The current open reward models often suffer from over-adjustment to close benchmarks, such as RewardBench, which limits their real utility. To remedy this, the researchers introduce a human-awarded pipeline in two stages to organize data preferably on a large scale. Step 1 begins with annotations verified by man to guide LLM in the labeling of the various attributes preferably, followed by an iterative training and an analysis of errors to refine the reward model. Step 2 evolves this process using coherence checks between the best and a “gold” reward model formed by humans, filtering reliable samples without other human entry. This approach establishes a balance between quality and scalability, ultimately allowing the creation of dozens of millions of high quality preferences.

Benchmarking Skywork-Reward-V2: compact but powerful models

The Skywork-Reward-V2 series demonstrates strong performance through several landmarks, surpassing the two larger models (for example, 70B parameters) and the emerging generative reward models. Trained using Qwen3 (0.6b-8b) and Llama 3.1 / 3.2 (1b-8b) squeries, these models get high scores on the reward, the PPE, the RM-Bench and the Judabench, with the most efficient variant (Llama-3.1-8b-40m) exceeding all the others with an average scoring of 88.6. Despite smaller model sizes, the SkyWork-Reward-V2 models benefit from high-quality data (Synpref-40m) and effective training configurations, allowing them to better generalize in RLHF scenarios in the real world. In particular, even medium-sized models such as QWEN3-1.7B surpass certain 70B models, emphasizing the impact of data quality and methodology on the number of pure parameters.

Screenshot 2025 07 06 at 7.06.59 PM 2

Conclusion and future perspectives: precision scaling

In conclusion, Synpref-40m, a preferably large-scale data set built by a two-step human collaboration, combining a human judgment with a scalability based on LLM. Using an organized subset of 26 million preferences, the team has developed the SkyWork-Reward-V2, a series of eight reward models (0.6B-8B) parameters that surpass existing models on seven key landmarks. These models show a strong generalization in the alignment of human values, guaranteeing accuracy, security and robustness at the bias. In -depth studies confirm that the quality of data quality and the preservation method are the main drivers of performance. In the meantime, researchers aim to explore new training strategies, because the reward models become central to the development and alignment of LLM.


Discover the Paper,, Model on the embraced face And GitHub page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter,, YouTube And Spotify And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.


author profile Sana Hassan

Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.