New from the Chinese Science Academy: Stream-OMNI, an LLM for inter-modal real-time AI

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Understand the limits of current omni-modal architectures

Large multimodal models (LMM) have shown exceptional omni capacities through text, vision and speech methods, creating a large potential for various applications. While vision-oriented LMMS have been successful, the Omni-Modals LMM which support the interaction of speech based on visual information are faced with challenges due to intrinsic representation differences between the methods. The recent omni-mudal LMMs aim to unify text, vision and speech by combining representations from individuals of individual modality along the sequence dimension. However, they depend on large -scale data to learn the alignments of modality in a manner focused on data. This is not aligned with limited public tri-modal data sets and does not have enough flexibility to produce intermediate text results during speech interactions.

Categorize existing LMMs by a modal focus

The current LMMs are divided into three categories: oriented towards vision, oriented by speech and omni-modal. Vision -oriented LMMS such that Llava use vision encoders to extract visual characteristics, which are then combined with textual inputs and transmitted in LLM to generate text. LMMs focused on speaking use either continuous methods, such as mini-omni and llama-omni, to project functionalities in LLM integration spaces, or discreet speaking units, such as speech and moshi, to convert speech into discreet units for direct LLM treatment. OMNI-Modaux LMMs such as Vita-1.5, Minicpm2.6-O and QWEN2.5-OMNI extract from Various Coder, Concatener for a multimodal understanding and use speech decoders for synthesis.

Presentation of Stream-OMNI: an alignment approach centered on the text

Researchers from the University of the Chinese Academy of Sciences have proposed Stream-OMNI, a wide linguistic discision model designed to meet the challenges of alignment of modality in omni-modal systems. He uses an LLM dorsal spine and aligns vision and speech methods for text according to their semantic relationships rather than simple concatenation approaches. Stream-OMNI aligns the modalities by integrating their semantic relationships with the text. For vision, the method applies the concatenation of the sequence dimension to align vision and text. For speech, he introduces a layer's dimension map based on CTC for the alignment of speech text. Stream-OMNI design overcomes the limits of concatenation methods by introducing targeted alignment mechanisms.

Presentation of architecture: integration of double layer speech and visual coding

The architecture of Stream-OMNI uses an LLM spine with progressive alignment strategies. For the alignment of the viewed text, Stream-OMNI applies a vision coder and a projection layer to extract visual representations. For the alignment of the text text, it introduces special speech layers present at the bottom and at the top of the LLM dorsal spine, allowing a bidirectional cartography between the methods of speech and text. Stream-OMNI builds its training corpus through automated pipelines, using LLAVA data sets for vision text pairs, Librispenech and Wenetspeech for speech text data and the creation of the Instructomni data set by converting current instructions for the Synthesis of speech.

Benchmarking multimodal capacities in the fields

In visual understanding tasks, Stream-OMNI achieves performance comparable to advanced LMMS focused on Vita-1.5 vision and surpasses, reducing the interference of the modality while maintaining strong visual capacities. For the interaction of speech, Stream-OMNI shows performance based on exceptional knowledge by using less vocal data (23k hours) compared to discrete discreet unit models such as Speechgpt, Moshi and GLM-4-revs. In speech interaction assessments on vision on the reference Spokenvis, Stream-Omni surpasses Vita-1.5 in visual understanding of the real world. The quality of the word cartography with Stream-Omni reaches superior ASR performance on the Librispeleech reference both in precision and in inference time.

Conclusion: a paradigm shift in multimodal alignment

In conclusion, the researchers introduced Stream-OMNI, a solution to the challenges of alignment of modality in omni-modal systems. This method shows that the alignment of effective modality can be obtained by a concatenation of sequence dimension for pairs of visual text and the cartography of the layer dimension for the integration of speech text, eliminating the need for extended tri-model training data. In addition, this research establishes a new paradigm for LMM omni-modals, showing that targeted alignment strategies based on semantic relationships can overcome the limits of traditional approaches based on concatenation in multimodal AI systems.


Discover the Paper And Model on the embraced face. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.


Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.