Revisual-R1: a model of large multimodal language (MLLMS) open source which reaches long, precise and thoughtful reasoning

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

The challenge of multimodal reasoning

Recent pierced in textual languages, such as Deepseek-R1, have shown that RL can help develop solid reasoning skills. Motivated by this, the researchers tried to apply the same RL techniques to the MLLM to improve their ability to reason on visual and textual entries. However, these attempts were not entirely successful; MLLMs always have difficulties with complex reasoning tasks. This suggests that the simple reuse of RL strategies from text models only may not work well in multimodal parameters, where the interaction between different types of data introduces new challenges that require more personalized approaches.

Evolution of multimodal language models

Recent research in MLLMS is based on the progression of LLMs by combining visual entries with understanding language. The first models, such as clip and minigpt-4, laid the foundations, followed by models set by instruction like Llama. While the closed source models show solid reasoning through long cot outputs, the open-source models have mainly focused on fine adjustment and COT adaptations. However, these often give brief responses which limit in-depth justification. RL, including techniques like RLHF and GRPO, turned out to be promising to improve reasoning in LLM. Inspired by this, recent work now aim to apply RL in MLLMS to improve visual reasoning and take care of richer and longer outings.

Introduction of Revisual-R1

Researchers from the University of Tsinghua, the University of Shanghai Jiao Tong and the Artificial Shanghai Intelligence Laboratory have introduced Revisual-R1, an open-source MLLM 7B parameter which establishes a new standard in multimodal reasoning. Their study reveals three key ideas: (1) a carefully meticulous text sample provides a strong cold start, surpassing many existing MLLMs even before RL; (2) The GRPO algorithm commonly used suffers from stagnation of the gradient, which they deal with with a new method called priority advantageous distillation (PAD); and (3) adding a final RL phase only in text after RL Multimodal further improves reasoning. Their three-step approach, which includes the pre-training of the text, the Multimodal RL and the RL of the final text, establishes an effective balance between visual landing and deep cognitive reasoning.

Develop the grammar data set

The grammar data set has been developed after being noticed that multimodal multimodal storm data sets do not have the depth necessary to form solid reasoning models. Text data sets only, like Deepmath, have shown better gains in text and multimodal tasks, which suggests that textual complexity stimulates reasoning better. To remedy this, grammar combines various textual and multimodal samples using a conservation process in several stages. These data feed the managed strengthening optimization framework (SRO), which first forms the models using a multimodal RL, improved by priority advantageous distillation to avoid blocked learning and an effective length reward to curb verbity, followed by an RL phase only to increase the reasoning and control of language.

Three -step training pipeline

Revisual-R1's experiences followed a training process structured in three stages: starting with pure text data to create a linguistic foundation, then incorporating the learning of multimodal strengthening for visual text reasoning, and finally fine adjustment with an RL only to refine reasoning and mastery. It was tested on various references and has surpassed both the open source and certain commercial models in multimodal and mathematical reasoning tasks. The model obtained the best results on 9 marks out of 10. Ablation studies have confirmed the importance of the training order and the method of distillation of the priority advantage, which helped to focus learning on high quality responses, resulting in a significant improvement in overall performance.

Summary and contributions

In conclusion, Revisual-R1 is an Open Source 7B MLLM built to meet the challenges of complex multimodal reasoning. Instead of relying only on the scale, it uses a well -designed three -step training process: starting with high quality text data for a fundamental justification, followed by an improved multimodal RL phase with a new PAD technique for stability and ending with a final refinement of RL based on the text. This reflected program considerably stimulates performance. Revisual-R1 establishes a new reference among the 7B models, excelling in tasks like Mathverse and loves. The work underlines how structured training can unlock deeper reasoning in the MLLM.


Discover the Paper And GitHub page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.


Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.