Vision models (VLMS) have become fundamental components for multimodal AI systems, allowing autonomous agents to understand visual environments, reason on multimodal content and interact with digital and physical worlds. The importance of these capacities has led to in -depth research between architectural conceptions and training methodologies, resulting in rapid progress in the field. Xiaomi researchers introduce Mimo-VL-7B, a compact but powerful VLM comprising three key components: a native vision vision transformer encoder that preserves visual details with fine grain, a multi-layer piercetron projector for effective modeling of the cross-model language, and the Mimo-7b language model optimized for complexON tasks.
Mimo-VL-7B undergoes two sequential training processes. The first process is a pre-training phase in four stages, including the warm-up of the projector, the alignment of visual language, general multimodal pre-training and the long-controlled end-up adjustment, which consumes 2.4 billions of tokens from high quality organized data games. This gives the Mimo-VL-7B-SFT model. The second process is the post-training phase, which introduces mixed learning to strengthen policy (MORL), the integration of various reward signals covering the accuracy of perception, the accuracy of visual landing, logical reasoning capacities and human preferences. This gives the Mimo-VL-7B-RL model. The main results reveal that the incorporation of high-quality reasoning data and wide coverage of the pre-training step improves the model performance, while the realization of stable simultaneous improvements remains difficult.
The Mimo-VL-7B architecture contains three components, (a) a vision transformer (vit) to code visual entries such as images and videos, (b) a projector that maps visual coding in a latent space aligned with the LLM, and (C) the LLM itself, which performs textual understanding and reasoning. QWEN2.5-VIT is adopted as a visual coer to support native resolution inputs. The LLM dorsal spine with the Mimo-7b base as its strong reasoning capacity, and a multilayer pierperon initialized at random (MLP) as a projector is used in the architecture of the model. The model pre-training data set includes 2.4 chip billions, various multimodal data, image legends, intertwined data, optical character recognition data (OCR), earthmorating data, video content, Gui interactions, reasoning examples and text sequences only.
The post-training phase still improves Mimo-VL-7B on difficult reasoning tasks and with the alignment of human preferences by using the Morl frame which transparently integrates learning to strengthen with verifiable rewards (RLVR) and RLHF. RLVR uses reward functions based on rules for continuous self-improvement, so several verifiable reasoning and perception tasks are designed to validate the final response with precision using predefined rules. RLHF is used in this context of a verifiable reward to combat the alignment of human preferences and to alleviate unwanted behavior. In addition, the morl is implemented to optimize the RLVR and RLHF objectives simultaneously.
The complete evaluation on 50 tasks demonstrates the advanced performance of Mimo-VL-7B among the open-source models. In general capacities, models obtain exceptional results on general vision tasks, with Mimo-VL-7B-SFT and Mimo-VL-7B-RL obtaining 64.6% and 66.7% on MMMUvalveRespectively, outperforming larger models like Gemma 3 27b. For the understanding of the documents, Mimo-VL-7B-RL excels with 56.5% on Charxivrq, considerably exceeding Qwen2.5-VL of 14.0 points and intervl3 of 18.9 points. In multimodal reasoning tasks, RL and SFT models considerably surpass open-source, Mimo-VL-7B-SFT base lines even exceeding much more important models, including QWEN2.5-VL-72B and QVQ-72B-PREVIEW. The RL variant reaches additional improvements, increasing the precision of mathvision from 57.9% to 60.4%.
Mimo-VL-7B shows exceptional capacities for understanding the graphical interface and earthing, the RL model outperforming all compared VLMs and performing performance comparable or superior to the models specialized by the graphical interface on difficult benchmarks such as the Pro-Pro and the OSWORLD-G. The model obtains the highest Elo note among all the open source VLMS, ranking first through the models covering the parameters from 7b to 72b and approach proprietary models like Claude 3.7 Sonnet. Morl provides a significant increase of 22+ points to the SFT model, validating the effectiveness of the training methodology and highlighting the competitive capacity of this VLM approach for general use.
In conclusion, the researchers introduced Mimo-VL-7B models which reach advanced performance thanks to organized and high quality pre-training data and the Morl frames. The main information on development includes consistent performance gains by integrating reasoning data into the subsequent preliminary stages, the advantages of RL on GRPO vanilla policy and the challenges of task interference when applying Morl on various capacities. The researchers open the complete evaluation suite to promote transparency and reproducibility in multimodal research. This work increases the open-source vision models capable and provides valuable information to the community.
Discover the Paper,, GitHub page And Model on the embraced face. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.
Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.
