LLMS Multimodal without compromise: researchers from UCLA, UW – Madison and Adobe introduce X Fusion to add a vision to frozen language models without losing language skills

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

The LLMs have made significant progress in language -related tasks such as conversational AI, reasoning and code generation. However, human communication extends beyond the text, often incorporating visual elements to improve understanding. To create a really versatile AI, the models need to simultaneously process text and visual information. The formation of these unified visual language models from zero using methods such as the prediction of self -regressive token or a hybrid approach combining dissemination and losses of language has shown high performance. However, this requires large calculation resources and recycling for each new modality. An alternative approach adapts LLM pre-stretched with vision capacities, which offers a more effective path but often compromises the original performance of the language model.

Current research has focused on three main strategies: LLM fusion with generation of autonomous image generation, the formation of large multimodal models from start to finish or the use of a combination of diffusion and autoregressive losses. Although these methods have obtained advanced results, they require recycling of large models or cause degradation of the basic capacities of the LLM. Despite these challenges, taking advantage of the pre-trained LLM with added vision components has demonstrated significant potential, especially in tasks involving the understanding and generation of the image. However, these methods are still faced with limitations in terms of efficiency and flexibility.

Researchers from the UCLA, the University of Wisconsin-Madison and Adobe Research offer X-Fusion, which adapts pre-stretched LLM for multimodal tasks while preserving linguistic capacities. X-Fusion uses a double-tone architecture, freezing the linguistic weights of the LLM while adding a specific tower to the visual information. The approach aligns text and vision characteristics on several levels, improving the performance of text image and image text tasks. Thanks to ablation studies, researchers highlight the importance of clean image data for training and show that the alignment of vision characteristics with pre-formulated representations accelerates convergence, in particular for smaller models.

X-Fusion is a unified framework that adapts the pre-stretched LLM for vision tasks while retaining their linguistic capacities. He uses a double -tone design, freezing the weights of the LLM text while introducing a separate vision tower to process visual information. The images are tokenized using a pre-trained coder, and the image and text tokens are optimized jointly. The model incorporates an optional X-Fuse operation to merge the features of the two laps for improved performance. X-Fusion is formed with self-regressive and image discharge losses, and its performance is evaluated on the tasks of image generation (text-image) and image understanding (image-to-text).

The study assesses the double -tone architecture against variants of alternative transformers for multimodal integration. He compares the single tower, the closed tower and the double projection conceptions, highlighting the flexibility of the double tower for image and text tasks. The double tower works best in generation and understanding of images, surpassing other 23% conceptions in FID without increasing training parameters. The study also studies the effects of noise and performance data reports, noting that clean images improve understanding and generation. In addition, the alignment of vision characteristics with a pre-trained encoder increases performance, especially for smaller models.

In conclusion, X-Fusion is a framework that adapts the LLM pre-stretched to multimodal tasks, such as understanding and generation of images, while preserving language capacities. It introduces a double -tone architecture where the weights of language remain fixed, and a separate training tower treats the visual characteristics. The experimental results show that X-Fusion surpasses alternative conceptions in image and text-to-image tasks. The main results include the advantages of incorporating data focused on understanding, noise reduction in image data and the positive impact of functionality alignment, in particular for smaller models. Research provides valuable information on the construction of effective multimodal models.


Discover the Paper. Also, don't forget to follow us Twitter.

Here is a brief overview of what we build on Marktechpost:


Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.