VLM2W-V2: a unified computer vision framework for learning multimodal integration on images, videos and visual documents

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Integration models act as bridges between different data methods by coding for various multimodal information in a shared dense representation space. There has been progress in integration models in recent years, motivated by progress in major foundation models. However, existing multimodal incorporation models are formed on data sets such as MMEB and M-BEIR, with most natural images and photographs from MSCOCO, Flickr and Imamenet data sets. These data sets do not cover more important visual information forms, including documents, PDFs, websites, videos and slides. This means that existing integration models underperform on realistic tasks such as looking for articles, search for websites and YouTube video search.

Multimodal integration benchmarks such as MSCOCO, Flickr30K and conceptual legends initially focused on static image text pairs for tasks such as image subtitling and recovery. More recent benchmarks, such as M-Beir and Mmeb, have introduced multi-tasted assessments, but remain limited to static images and short contexts. Learning video representation has evolved through models such as videoclip and videococa, integrating contrasting learning with subtitrants. Representation of visual documents Learning has advanced via models like Colpali and Visrag, which use VLM for the recovery of documents. Methods of recovery of the unified modality like GME and uni-rerivals reach strong performance on universal references. However, none can unify the image, video and recovery of visual documents in a single framework.

Researchers from Salesforce Research, UC Santa Barbara, the University of Waterloo and the University of Tsinghua have proposed VLM2W-V2 to unify the image, video and recovery of visual documents in a single framework. First, the researchers developed Ms.B-V2, a reference that extends Ms.B with five new types of tasks, including recovery of visual documents, video recovery, temporal earth setting, video classification and answers to video questions. Secondly, VLM2W-V2 serves as an integration model for general use which supports several input methods while demonstrating solid performance on newly introduced tasks and original image references. This establishes a basis for learning more evolutionary and flexible representation in research and practical applications.

AD 4nXfHZba0snS28ZQGPXxvFvlMs kHztelPu9nIxnHbYxgShvDLPug8sh4yFM3NnaXC4o5PDIbIsyqYWNDgzXv0VAzi4B3 y2kT21gqccnttdvy he8BA753bmwumlYQ8kBNNwaKMyMA?key=nPuCvPGUzSifw9C9 mVRJgAD 4nXfHZba0snS28ZQGPXxvFvlMs kHztelPu9nIxnHbYxgShvDLPug8sh4yFM3NnaXC4o5PDIbIsyqYWNDgzXv0VAzi4B3 y2kT21gqccnttdvy he8BA753bmwumlYQ8kBNNwaKMyMA?key=nPuCvPGUzSifw9C9 mVRJg

VLM2W-V2 uses Qwen2-VL as skeleton, selected for its specialized capabilities in multimodal treatment. QWEN2-VL offers three critical features that support learning unified integration: naive dynamic resolution, integration of multimodal rotary position (M) and a unified frame that combines 2D and 3D convolution. To Enable Effective Multi-Task Training Across Different Data Sources, VLM2 WET IN FLEXIBLE DATA SAMPLING PIPELINE With Two Key Components: (A) On-The-Fly Batch Mixing Based on Predefined Sampling Weight Tables that control the relative probabilitities of each dataset, and (b) Interleaved Sub-Batching Strategy That Splits Full Batches Into Independently Samped Sub-Batches, Improving the Stability of Contrastive Learning.

VLM2W-V2 obtains the highest average score of 58.0 out of 78 data sets covering image, video and visual documents, outperforming strong high-base lines, in particular GME, LAMRA and VLM2 with built on the same QWEN2-VL skeleton. On image tasks, VLM2W-V2 surpasses most of the basic lines by significant margins and obtains performance comparable to VLM2 with 7b despite the fact that it is only 2B parameters. For video tasks, the model obtains competitive performance despite training on relatively small amounts of video data. In the recovery of visual documents, VLM2WV surpass all the VLM2VEC variants, but always lagging behind Colpali, which is specifically optimized for visual document tasks.

In conclusion, the researchers introduced VLM2 with a solid basic model formed by contrastive learning through various tasks and combinations of modality. VLM2W-V2 is built on MmeB-V2 and uses Qwen2-VL as a screen model. Ms.B-V2 is a reference designed by researchers to assess multimodal integration models in various methods, including text, images, videos and visual documents. The experimental evaluation demonstrates the effectiveness of VLM2 with to achieve balanced performance through several methods while highlighting the diagnostic value of Ms.B-V2 for future research.


Discover the Paper,, GitHub page And Model on the embraced face. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.


photo sajjad Ansari

Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.