Researchers introduce mmlongbench: a complete reference for long -context vision models

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Recent progress in long -context modeling (LC) has unlocked new capacities for LLM and major vision language models (LVLM). Long-context languid-langaining models (LCVLM) show an important step in front by allowing LVLM to process hundreds of images and thousands of text tokens intertwined in a single front pass. However, the development of effective assessment benchmarks. We still do not know how current LCVLM operates in long -context parameters, the tasks with which they fight and how robust they are to grasp the variation of the length. The current benchmarks are faced with the following problem: (a) a limited coverage of downstream tasks, (b) insufficient coverage of the types of images, (c) the lack of control of the context length and (d) the single context length.

Various techniques have extensive context windows for LVLM, including longer pre-training lengths, position extrapolation and effective architectures. Models like Gemini-2.5 and QWEN2.5-VL adopted these approaches alongside Vision tokens compression methods to adapt to longer sequences. For evaluation, the needle task in a haystack has become a standard reference to test LC capacity by inserting information at specific depths in long texts. However, the existing references in visual language remain limited, focusing solely on NIAH variants or Long Document VQA tasks. Even Milebench contains short context tasks with an average length of only 9k tokens, not having evaluated the real LC capacities through various visual language applications.

HKUST researchers, Tencent Ai Seattle Lab, University of Edinburgh, Miniml.ai and Nvidia AI Technology Center proposed Mmlongbench, the first complete reference to assess LCVLM. It includes 13,331 examples covering five categories of downstream tasks, including visuals CLOTH and ICL several blows, covering the types of natural and synthetic images. All examples are standardized on five entry lengths from 8k to 128k tokens using a cross tokenization scheme combining vision fix and text tokens. Thanks to the comparative analysis of 46 closed and open source models, research reveals that the performance of a single task does not predict the overall LC capacity, the two types of models fight with LC tasks, and stronger reasoning models show better LC performance.

Researchers build LC by inserting gold passages containing responses among the large sets of distracting passages recovered from Wikipedia. For vics, Kilt gold passages are used, while Infoseek uses lead sections from the wikipedia entities pages. In addition, the Wikipedia pages are divided into 100 words passages and recovered distracts are added until reaching the desired entry lengths. Learning tasks in the context on several occasions use four sets of classification data of various images: Stanford Cars, Food101, Sun397 and Inat2021, to accommodate 500 images in 128K context windows. The counting of trans-modal tokens combines text tokens using the LLAMA2 tokenzer with visual tokens treated through 14 × 14 fixes and a 2 × 2 pixel non-tabac compression, ensuring compatibility with modern LVLM for evaluation.

The mmlongbench assessment between tasks and context lengths shows that all models are fighting, but closed source models work better. For the longest input length of 128K, all models fight with vision tasks in long context, GPT-4O reaching only 62.9 average performances. Gemini-2.5-Pro ​​has become the strongest performer, surpassing the open source models of 20 points, except on ICL tasks. In addition, the OVIS2-34B model obtains a score of 41.6 on the summary, similar to GPT-4O (42.4). Qwen2.5-VL-32B obtains a Subse score of 64.6 on VRAG, even better than Gemini-2.0-Flash. The models show generalization capacities beyond their training context length, QWEN2-VL-72B reaching an average score of 51.9 to 128K despite a 32K drive window.

In conclusion, the researchers introduced mmlongbench, the first complete reference to assess LCVLM through various downstream tasks. It provides a rigorous base to diagnose the capacities of border models by covering five categories of distinct tasks with a counting of unified intermodal tokens and standardized context lengths. The evaluation of 46 models demonstrates that the performance of a single task predicts without reliability of the overall long-controlled capacity, and the border models are faced with significant challenges in the precision of the OCR and transversal recovery. Mmlongbench is a standard evaluation framework to stimulate future research towards more effective visual language encodings, robust position extrapolation positions and improved multimodal recovery and reasoning capacities.


Discover the Paper And GitHub page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.


Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.