MIROMIND-M1: Advising open-source mathematical reasoning via apprenticeship in multi-stage reinforcement in the context

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

The models of large languages (LLM) have recently demonstrated remarkable progress in the reasoning in several stages, establishing the resolution of mathematical problems as a rigorous reference to assess advanced capacities. While proprietary models like GPT-4O and Claude Sonnet 4 lead performance, their nature with a closed source hinders transparency and reproducibility. To end these shortcomings, Miromind Ai published the Miromind-M1 series, a fully open-source pipeline—The data sets, models, training code and evaluation scripts – which establishes new standards for advanced opening and mathematical reasoning in the Écosystem of the QWEN -2.5 model.

Architectural foundation and motivation

MIROMIND-M1 is built on the Robust skeleton Qwen-2.5, with improvements focused explicitly for mathematical reasoning. The team adopts a two -step training protocol:

  1. Supervised end adjustment (SFT): The model is refined on carefully organized and verified mathematical problems, equipping it with strong reasoning capacities step by step.
  2. Learning by strengthening with verifiable rewards (RLVR): Then, the model undergoes a RL on 62K difficult and rigorously verifiable mathematical problems, taking advantage of the reward signals of a robust external verifier.

This approach is motivated both by the need for a strong mathematical logic and by the lessons learned from the main RLM: the imitation of the examples of the chain of thoughts improves general reasoning, while learning to strengthen, guided by precise rewards, refines more precision and efficiency.

Transparency and data quality

A characteristic of the Mimind-M1 project is the full opening and cleanliness of its training data:

  • Composition of the SFT corpus: OpenR1, OpenR1, Light-R1 and Synthetic-1 titles, guaranteeing the problems of the verified solutions and the traces of reasoning rich in several stages.
  • Strict deduplication and decontamination: Uses N-GRAM overlapping filtering to eliminate duplication and data leakage with evaluation sets (for example, AIME24, AIME25, MATH500).
  • Preference for long trajectories: Experiences show that the training on samples with longer traces of reasoning gives higher reference scores, stressing the importance of deep semantic content in the reasoning signal.

The resulting data set provides 719K of verified training traces – significantly advancing the reproducible research open to previous efforts.

Supervised fine refinement: empirical excellence

For SFT, Miromind-Sft-7B is initialized from QWEN2.5-MATH-7B and formed with a large context window (MAX 32 768 tokens) and a strategy without packaging to avoid contamination of the attention of the cross sample. Its performances on mathematical reference keys exceed the opening models by peers:

Model Love24 Love25 Math500
Deepseek-R1-Distill 55.5 40.4 92.8
MIMO-7B-SFT 58.7 44.3 93.0
Mimoman-SFT-7B 60.4 45.0 94.6

These results validate the effectiveness of data conservation and training design: richer and deeper samples and without packaging lead to always higher performance.

Campo: optimization of complementary contextual multi-stage strategy

A key innovation in the RLVR phase of Miromind-M1 is the Campo algorithm. Campo takes up two challenges of Critical RL: the instability formation and the ineffectiveness of the tokens – by:

  • Training in several stages with expansion context limits: Training begins with constrained exit lengths (for example, 16K tokens), then gradually increases to allow deeper reasoning, balance efficiency and rigor.
  • Dynamic rehearsal penalty: A dedicated rehearsal criticism penalizes the results presenting an early or excessive repetition, preventing the collapse of public services and applying the diversity of results.
  • Specific external verifier: The reward feedback system is considerably improved for mathematical responses of robust score (including delicate cases with units, Ï€ and percentages), ensuring that training signals are closely aligned with real correction.

Campo stabilizes not only the RL dynamics, but also translates into models that solve problems with less relevant tokens – accelerating inference and reducing costs without sacrificing precision.

Reference performance: advanced efficiency

The open models of Mimind obtain highly competitive or advanced results for the mathematical models open on QWEN-2.5 (parameters 7B / 32B):

Model Love24 Love25 Math500
Deepseek-R1-7B 55.5 39.2
MIMO-7B-RL 68.2 55.4 95.8
Skywork-Orb 72.2 54.6
Mimoman-RL-7B 73.4 57.8 96.7
Skywork-32B 77.1 68.2 97.5
MIROMIN-RL-32B 77.5 65.6 96.4

In particular, the MIROMind-M1-RL models correspond not only or exceed the precision of peers, but do it with greater efficiency of tokens-the 32B model produces shorter and more concise solutions without loss of accuracy, thanks to the training of Campo.

Complete and reproducibility battery

Each component of the Mimind-M1 battery is openly released:

  • Model weight (SFT and RL checkpoints for scales 7b and 32b)
  • Data sets (719k sft, 62k RLVR)
  • Training scripts (Support a multi-node distributed training on Ray)
  • Evaluation code (standardized scripts and reference configurations)

Researchers can reproduce, audit and extend Mimind-M1 from raw data to trained models, progressing reproducibility and accelerating new open research.

Conclusion

Miromind-M1 shows that with a careful conservation of data, innovative RL algorithms (Campo) and radical transparency, open source language models can compete with proprietary systems in advanced mathematical reasoning. This project establishes a new bar for reproducibility and collaborative progress in LLMS reasoning, offering both a high quality resource and a robust platform for future innovation.


Discover the Paper,, GitHub page And Model on the embraced face. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.


Bio picture Nikhil

Nikhil is an intern consultant at Marktechpost. It pursues a double degree integrated into materials at the Indian Kharagpur Institute of Technology. Nikhil is an IA / ML enthusiast who is still looking for applications in fields like biomaterials and biomedical sciences. With a strong experience in material science, he explores new progress and creates opportunities to contribute.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.