Can we improve the reasoning of Llama 3 by post-training alone? Astro watch + 16% to + 20% reference gains

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Improving the reasoning capacities of large language models (LLM) without architectural changes is a basic challenge in the progress of the alignment and the conviviality of AI. Researchers from Meta Ai and the University of Washington have introduced AstronSelf -regressive research—A new post-training frame designed to improve reasoning in LLAMA-3.1-70B-Instruct. Astro is unique in teaching models to perform Research in context,, self-reflectionAnd go backMechanisms often associated with human problem solving and traditional symbolic research algorithms. Thanks to this approach, Astro stimulates the mathematical performance of Llama 3 on several competitive benchmarks with significant improvements:

  • Math 500: 65.8% ➝ 81.8%
  • AMC 2025: 37.5% ➝ 64.4%
  • Like 2025: 10.0% ➝ 30.0%
Screenshot 2025 07 04 at 10.17.45 AM 1

Generation of the chain of thought guided by research

Astro's methodology begins with a Monte Carlo Tree search (MCTS) On mathematical problem -solving mathematical trajectories. This research explores both correct and incorrect reasoning paths. Key innovation is Cloning procedure: Whole research trees are linearized in Long reflection chain (COT) which naturally code failures and recovery via self-reflection And go back. These linearized traces are rewritten in natural language and used as a basis for a supervised fine setting (SFT).

The result is a model that does not only solve the problems step by step but re -evaluates its trajectory – often behind after self -evaluation to correct intermediate reasoning errors. For example, the model can intervene with sentences like “let's go back to the place where we configure the equation” when its internal confidence drops.

Supervised fine refinement: injection of research priories

Astro Fine Tunes Llama-3.1-70B-Instruction on 36.1K COT solutions organized from mathematical data games, AMC / AIPS and AOP style. The model formed with ASTRO-SF realizes:

  • Math 500: 69.6%
  • AMC 2025: 51.9%
  • Like 2025: 16.3%

These scores are competitive with or exceed those of the reference variants and spoc / no KTO formed without explicit research. Above all, even SFT alone – without strengthening learning – the performance of areas increases by exposing the model to structured research data.

Screenshot 2025 07 04 at 10.18.35 AM 1

Learning to strengthen conscious initialization of research

Astro is learning to strengthen (RL) by initializing with the SFT control point and by performing a RL loop using a modified Optimization of the relative group policy (GRPO). Unlike the standard RL based on preferences, Astro uses verifiable reward signals (+1 for correct, -1 for incorrect) on 8.7K moderately difficult invites. During training, the CO generation of the model develops longer – from ~ 1.8K to ~ 6k tokens – demonstrating a deeper internal exploration.

The resulting Astro-RL The model realizes:

  • Math 500:: 81.8%
  • AMC 2025:: 64.4%
  • Like 2025:: 30.0%

These results compete or exceed models with more important parameter counts and confirm the importance of the initialization of Astro's search.

The return behavior back is correlated with the success of reasoning

Striking empirical observation is the positive correlation Between the frequency of return to the back and the performance. As training progresses, Astro-RL has more self-corrigerated actions and deeper exploration. Pearson correlation coefficients through the landmarks exceed 0.8, which indicates that self-reflection and backwards are not simply cosmetic behavior but functionally linked to better precision.

Comparative information and broader impact

Control experiences comparing Astro with models formed on direct speaker bed solutions (no research prior) reveal that even when formed on the even The sets of problems and the research trees, Astro constantly surpasses. For example, Astro-Rl Bat Direct-RL by:

  • + 2% on mathematics 500
  • + 3.9% on AMC 2025
  • + 2.9% on love 2025

In addition, astro outputs can be visualized as Directed graphicsWith nodes as stages of reasoning and edges capturing transitions, reflections and corrections – facilitating better interpretability.

Table to remember Astro Key

image

Conclusion

ASTRO shows that LLM like Llama 3 can learn to reason more effectively-not through larger models or a longer pre-training, but via post-training techniques in principle. By imitating research algorithms in natural language, Astro allows models of Think before answering,, doubt their own stepsAnd Correct yourself in the middle of the season. This framework establishes a new reference for LLM open to fine adjustment to approach human reasoning through behaviors inspired by research.


Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.


Screen Shot 2021 09 14 at 9.02.24 AM

Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.

a sleek banner advertisement showcasing

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.