Offline video lilms can now understand real-time flows: Apple researchers introduce Streambridge to allow a multi-tour and proactive video understanding

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Video-LLMMS treats whole pre-recorded videos at the same time. However, applications such as robotics and autonomous driving require causal perception and interpretation of visual online information. This fundamental inadequacy shows a limitation of current video Lilms, because they are not naturally designed to operate in streaming scenarios where timely understanding and responsiveness are essential. The transition from offline to streaming video understanding has two key challenges. Firstly, multi-Tours real-time understanding requires models to process the most recent video segment while maintaining a historic visual and conversational context. Second, the generation of proactive response requires human type behavior when the model actively monitors the visual flow and provides timely outputs depending on the deployment content without explicit invites.

Video-LLMs have drawn significant attention to video understanding, the combination of visual encoders, modality projectors and LLM to generate contextual responses from video content. Several approaches have emerged to take up the challenge of understanding video streaming. Videollmonline and Flash-Vstream have introduced specialized online objectives and memory architectures to manage sequential entries. Mmduet and Vispeak have developed dedicated components for the generation of proactive response. Several reference suites have been used to assess streaming capacities, including streamingbench, streambench, svbench, ommimmi and ovo-bench.

Apple and Fudan University researchers have offered Streambridge, a framework to transform Offline video Lilms into compatible streaming models. It falls under two fundamental challenges in the adaptation of existing models in online scenarios: the limited capacity of multi-information real-time understanding and the lack of proactive response mechanisms. Streambridge combines a memory stamp with a round compression strategy, taking charge of long context interactions. It also incorporates a slightly decoupled light activation model which fits perfectly into existing video LLM for the generation of proactive response. In addition, the researchers introduced Stream-IT, a large-scale data set designed for streaming video understanding, with mixed video sequences and various instructions.

Streambridge Framework is evaluated using offline video, Llava-Ov-7b, Qwen2-VL-7B and ORYX-1.5-7B. The stream-it data set is added with around 600,000 data set samples established to maintain the general capacity of video comprehension, including LLAVA-178K, VCG-Plus and Sharegpt4Video. Ovo-bench and streamingbench are used for a multi-tours real time understanding, focusing on their real-time tasks. The general understanding of the video is evaluated through seven marks, including three sets of short video data (MVBENCH, Perceptiontest, TempcomPass) and four long video references (Egoschema, Longvideobench, MLVU, Videcom).

The results of the evaluation show that Qwen2-VL Improved with average scores from 55.98 to 63.35 on Ovo-Bench and 69.04 to 72.01 on the streaming bench. On the other hand, llava-ov Light performance experience decreases, from 64.02 to 61.64 on Ovo-Bench and from 71.12 to 68.39 on the streaming bench. The fine adjustment to the Stream-IT data set gives substantial improvements between all models. Oryx-5.5 Reached +11.92 gains on Ovo-Bench and +4.2 on the streaming bench. In addition, Qwen2-VL Reached average scores of 71.30 on Ovo-Bench and 77.04 on streaming bench after adjusted stream-it, even outperforming proprietary models like GPT-4O and Gemini 1.5 Pro, showing the effectiveness of the streambridge approach in improving the capacity for streaming video comprehension.

In conclusion, the researchers introduced streambridge, a method to transform video-LLMS into effective streaming compatible models. Its double innovations, a memory stamp with a round compression strategy and a slightly decoupled activation model, note the main challenges of video understanding of diffusion without compromising general performance. In addition, the Stream-IT data is introduced for the streaming of video understanding, with specialized intertwined video text sequences. As video understanding in streaming is becoming more and more essential in robotics and autonomous driving, Streambridge offers a generalizable solution that transforms static video-LLMS into dynamic and reactive systems capable of significant interaction in constantly evolving visual environments.


Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 90K + ML Subdreddit.

Here is a brief overview of what we build on Marktechpost:


Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.