Models of large languages are now used for evaluation and judgment tasks, extending beyond their traditional text generation role. This led to “LLM-as-Aa-Judge”, where the models evaluate the outings of other models of language. These evaluations are essential in strengthening learning pipelines, reference tests and alignment of the system. These models of judge rely on the internal reasoning of the reflection chain, reflecting the processes of human judgment. Unlike conventional reward models that provide direct scores, these models simulate a thoughtful assessment, which makes them better suited to complex tasks such as solving mathematical problems, ethical reasoning and the interpretation of the intention of users. Their ability to interpret and validate the responses between languages and areas improves automation and scalability in the development of the language model.
However, current AI judgment systems are faced with problems of shallow inconsistency and reasoning. Many are based on basic measures or static annotations, which are inadequate to assess subjective or open prompts. A common problem is the position bias, where the order of responses affects the final decision, compromising equity. In addition, the collection of data annotated by large -scale humans is expensive and takes time, which limits the generalization of these models.
Several existing approaches have taken up these challenges, but with limited success. Systems like Evalplanner and Deepseek-GRM are based on data marked by humans or rigid training patterns, which limit adaptability between types of tasks. Others, like Deepseek-R1, depend on the distillation of large models but work badly on ambiguous prompts. Static data sets and offline adjustment strategies hamper dynamic reasoning, while more recent methods using score formatting or structured prompts have shown minimum precision improvements. Despite more important data and models, performance gains in traditional systems have stalled.
Researchers from the Genai and Meta Fair teams presented J1 To respond to the above limitations. J1 forms judgment models through a framework based on strengthening learning, which makes them able to learn via verifiable reward signals. The team used synthetic data to create high quality and low quality responses to an prompt, transforming subjective tasks into verifiable pairs. This synthetic data set included 22,000 pairs of preferences, divided between 17,000 invites from the WildChat corpus and 5,000 mathematical requests. These were used to form two versions of J1: J1-LLAMA-8B And J1-LLAMA-70BInitialized from the basic models LLAMA-3.1-8B-ISTRUCT and LLAMA-3.3-70B-ISTRUCT. The models have been formed using the optimization of the relative group policies (GRPO), a reinforcement algorithm which eliminates the need for criticism models and accelerates convergence.

At the heart of the training strategy is learning in a position position, where the input formats (X, A, B) and (X, B, A) are used in the training to prevent position bias. In addition, coherence -based rewards are only applied when the model delivers correct verdicts on the two responses. This structure allows the judge to be fair and reliable, regardless of the rapid or response order. The training framework supports several variations: models can produce final verdicts, digital scores for each response, or both. A variant of punctual judgment is included, which assesses unique responses using scores from 0 to 10. These formats make J1 a versatile and generalizable system capable of judging various tasks.

The results obtained using J1 models reveal improvements in substantial performance compared to existing systems. On the reference of preferably used proxy assessments (EPI), J1-LLAMA-70B obtained a global accuracy of 69.6%, of the models surpassed with more than ten times more data. On the other hand, models like Deepseek-GRM-27B and Evalplanner-Lalma-70B obtained 67.2% and 65.6% respectively. Even the smallest J1-LLAMA-8B model has exceeded basic systems like Evalplanner-Lama-8B, marking 62.2% against 55.5%. J1 has also shown high level performances on other critical benchmarks such as RewardBench, RM-Bench, Judgebench and FollowBencheval, demonstrating a robust generalization through verifiable and subjective tasks. These improvements are not only marginal but significant, taking into account the limited training data used in J1 compared to extended data sets in other models.

Several key dishes of research on D1:
- J1 is formed using 22,000 pairs of synthetic preferences, including 17K of Wildchat and 5K of mathematical tasks.
- The training uses GRPO, who rationalizes RL by avoiding the need for models of separate criticism.
- It introduces learning in an agnostic position, reducing position biases through rewards based on coherence.
- Two main model variants, J1-LLAMA-8B and J1-LLAMA-70B, were formed on modest data but have surpassed large-scale models.
- J1-Lalama-70B scored 69.6%on EPI, exceeding Deepseek-GRM-27B (67.2%) and Evalplanner-Lalma-70B (65.6%).
- Take charge of several judgment formats: in pairs with verdicts, in pairs with scores and punctual scores.
- Gets out of the models distilled from Deepseek-R1 and O1-Mini of Openai on several tasks.
- Demonstrates that the quality of reasoning, not only the size of the data set, is essential for specific judgments.
- The framework of J1 makes it a general judge applicable to verifiable and non -verifiable tasks.
In conclusion, the J1 approach fundamentally redefines the way in which judgment models are trained and evaluated. Synthetic data and learning to strengthen the traditional need for costly annotations while promoting fair, logical and coherent assessments. This work illustrates that the judgment focused on reasoning can surpass more important models that are strongly based on the volume of data and static alignment techniques. It also validates the idea that the judgment models should be the thinkers first and the scorers in second position. With performances that often compete and exceed advanced systems, J1 establishes a new reference in the formation of LLM-AS-A-Judge systems.
Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.
