The challenge of multimodal reasoning
Recent pierced in textual languages, such as Deepseek-R1, have shown that RL can help develop solid reasoning skills. Motivated by this, the researchers tried to apply the same RL techniques to the MLLM to improve their ability to reason on visual and textual entries. However, these attempts were not entirely successful; MLLMs always have difficulties with complex reasoning tasks. This suggests that the simple reuse of RL strategies from text models only may not work well in multimodal parameters, where the interaction between different types of data introduces new challenges that require more personalized approaches.
Evolution of multimodal language models
Recent research in MLLMS is based on the progression of LLMs by combining visual entries with understanding language. The first models, such as clip and minigpt-4, laid the foundations, followed by models set by instruction like Llama. While the closed source models show solid reasoning through long cot outputs, the open-source models have mainly focused on fine adjustment and COT adaptations. However, these often give brief responses which limit in-depth justification. RL, including techniques like RLHF and GRPO, turned out to be promising to improve reasoning in LLM. Inspired by this, recent work now aim to apply RL in MLLMS to improve visual reasoning and take care of richer and longer outings.
Introduction of Revisual-R1
Researchers from the University of Tsinghua, the University of Shanghai Jiao Tong and the Artificial Shanghai Intelligence Laboratory have introduced Revisual-R1, an open-source MLLM 7B parameter which establishes a new standard in multimodal reasoning. Their study reveals three key ideas: (1) a carefully meticulous text sample provides a strong cold start, surpassing many existing MLLMs even before RL; (2) The GRPO algorithm commonly used suffers from stagnation of the gradient, which they deal with with a new method called priority advantageous distillation (PAD); and (3) adding a final RL phase only in text after RL Multimodal further improves reasoning. Their three-step approach, which includes the pre-training of the text, the Multimodal RL and the RL of the final text, establishes an effective balance between visual landing and deep cognitive reasoning.
Develop the grammar data set
The grammar data set has been developed after being noticed that multimodal multimodal storm data sets do not have the depth necessary to form solid reasoning models. Text data sets only, like Deepmath, have shown better gains in text and multimodal tasks, which suggests that textual complexity stimulates reasoning better. To remedy this, grammar combines various textual and multimodal samples using a conservation process in several stages. These data feed the managed strengthening optimization framework (SRO), which first forms the models using a multimodal RL, improved by priority advantageous distillation to avoid blocked learning and an effective length reward to curb verbity, followed by an RL phase only to increase the reasoning and control of language.
Three -step training pipeline
Revisual-R1's experiences followed a training process structured in three stages: starting with pure text data to create a linguistic foundation, then incorporating the learning of multimodal strengthening for visual text reasoning, and finally fine adjustment with an RL only to refine reasoning and mastery. It was tested on various references and has surpassed both the open source and certain commercial models in multimodal and mathematical reasoning tasks. The model obtained the best results on 9 marks out of 10. Ablation studies have confirmed the importance of the training order and the method of distillation of the priority advantage, which helped to focus learning on high quality responses, resulting in a significant improvement in overall performance.

Summary and contributions
In conclusion, Revisual-R1 is an Open Source 7B MLLM built to meet the challenges of complex multimodal reasoning. Instead of relying only on the scale, it uses a well -designed three -step training process: starting with high quality text data for a fundamental justification, followed by an improved multimodal RL phase with a new PAD technique for stability and ending with a final refinement of RL based on the text. This reflected program considerably stimulates performance. Revisual-R1 establishes a new reference among the 7B models, excelling in tasks like Mathverse and loves. The work underlines how structured training can unlock deeper reasoning in the MLLM.
Discover the Paper And GitHub page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.
Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.
