Multimodal llms: expansion of capacities through text and vision
The widening of large languages (LLM) models to manage several methods, in particular images and text, has enabled the development of more interactive and intuitive AI systems. Multimodal LLM (MLLM) can interpret visuals, answer questions about images and engage in dialogues that include both text and images. Their ability to reason in visual and linguistic fields makes them more and more precious for applications such as education, content generation and interactive assistants.
The challenge of forgetting on the text only in the mllm
However, the integration of vision in LLMS creates a problem. When trained on data sets that mix images with text, MLLMs often lose their ability to manage purely textual tasks. This phenomenon, known by the name of forgetting in text only, occurs because the visual tokens inserted in the linguistic sequence divert the attention of the text model. Consequently, the MLLM begins to prioritize the content linked to the image and works badly on the tasks which require only understanding of language, such as basic reasoning, understanding or tasks of textual questions (Q&A).
Limits of existing mitigation strategies
Several methods try to treat this degradation. Certain approaches reintroduce large amounts of text data only during training, while others alternate between text and multimodal fine adjustment. These strategies aim to recall the model of its original linguistic capacities. Other conceptions include adapter layers or the base -based adjustment. However, these techniques often increase training costs, require a complex switching logic during inference or could not fully restore the understanding of the text. The problem follows largely from how the attention of the model moves when the image tokens are introduced into the sequence.
Presentation of the wings: a double prosecutor approach by Alibaba and Nanjing University
Researchers from the AI sales team of the Alibaba group and Nanjing University have introduced a new approach entitled Wings. The design adds two new modules: visual and textual learners – in each MLLM layer. These learners work in parallel with the main attention mechanism of the model. The structure resembles “wings” attached on each side of the layers of attention. A routing component controls the attention that each learner receives according to the current token mixture, allowing the model to dynamically balance its development between visual and textual information.
Low residual attention (Lorra): balancing efficiency and awareness of modality
The architecture of the wings uses a mechanism called low -rank residual attention (Lorra), which maintains light calculations while allowing learners to capture essential information specific to the modality. In the first stage of the training, only visual learners are activated to align the characteristics of the image. In the second step, visual and textual learners are co-formed with a router module that uses attention weights to allocate responsibility. Each learner uses effective attention blocks to interact with the surrounding image or text, and their outputs are combined with those of the main model. This ensures that visual attention does not overcome textual understanding.
Wings' performance benchmarks through multimodal text and tasks
In terms of performance, the wings have shown strong results. On the MMLU data set, he obtained a text score only of 60.53, which represents an improvement of 9.70 points compared to a similar basic model. For CMMLU, he scored 69.82, 9.36 points higher than the base line. In reasoning tasks like the top of the race, he won 11.9 points and in the WSC, an improvement of 11.12 points was recorded. In multimodal benchmarks like MMMU-Val, the wings have reached an improvement of 4.78 points. It has also demonstrated robust results on the IIT reference, more effectively managing multi-viing dialogues in text and mixed image as other open source mlms on the same scale.
Conclusion: towards more balanced and generalized MLLM
In summary, the researchers addressed the question of the forgetting of catastrophic text only in MLLMS by introducing wings, an architecture which combines visual and textual learners dedicated alongside the routing of attention. By analyzing attention changes and designing targeted interventions, they maintained text performance while improving visual understanding, offering a more balanced and more efficient multimodal model.
Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.
Nikhil is an intern consultant at Marktechpost. It pursues a double degree integrated into materials at the Indian Kharagpur Institute of Technology. Nikhil is an IA / ML enthusiast who is still looking for applications in fields like biomaterials and biomedical sciences. With a strong experience in material science, he explores new progress and creates opportunities to contribute.
