VISMAP: Unexpected videos of one hour using metadata and short data sets

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Video subtitling models are generally formed on data sets made up of short videos, generally in less than three minutes, associated with corresponding legends. Although this allows them to describe basic actions such as walking or speaking, these models fight with the complexity of long videos, such as vlogs, sporting events and films that can last more than an hour. When applied to such videos, they often generate fragmented descriptions focused on isolated actions rather than capturing the wider scenario. Efforts like MA-LMM and Lavila have extended video subtitling to 10-minute clips using LLMS, but one hour videos remain a challenge due to a shortage of appropriate data sets. Although Ego4D presented a large set of one -hour video data, its first -person perspective limits its broader applicability. The video summary approached this gap by training on one-hour videos with multi-granity annotations, but this approach is expensive and subject to annotation inconsistencies. On the other hand, the set of short annotated video data are widely available and more user -friendly.

The progress of visual language models has considerably improved the integration of vision and language tasks, with early works such as Clip and Align set the basics. Subsequent models, such as Llava and Minigpt-4, extended these capacities to images, while others have adapted them for understanding video by focusing on the modeling of the temporal sequence and building more robust data sets. Despite these developments, the scarcity of long annotated and annotated video data sets remains an important obstacle to progress. Short traditional traditional video tasks, such as answers to video questions, subtitling and earthing, mainly require a spatial or temporal understanding, while the summary of one-hour videos require the identification of key frames in the midst of substantial redundancy. Although some models, such as Longva and Llava-Video, can perform VQA on long videos, they find it difficult with summary tasks due to data limitations.

Researchers from Queen Mary University and Spotify present Vismap, an unsupervised method to summarize one hour videos without requiring expensive annotations. Traditional models work well on short and pre-segmented videos, but fight with longer content where important events are dispersed. VisMap fills this gap using LLM and a meta compression strategy to generate and refine pseudo-in summer from clip descriptions created by short video models. The process involves three LLM working in sequence for generation, evaluation and rapid optimization. VISMAP obtains performance comparable to fully supervised models on several data sets while maintaining the adaptability of the domain and eliminating the need for extensive manual labeling.

The study addresses the inter-domain video summary by training on a set of labeled video data and adapting to unmarked videos and an hour of a different domain. Initially, a model is formed to summarize 3 -minute videos using TIMEFORMER trainer features, an alignment module in visual language and a text decoder, optimized by cross entropy and contrasting losses. To manage longer videos, they are segmented in 3-minute clips and pseudo-captions are generated. An iterative meta-counting approach with several LLM (generator, evaluator, optimizer) refines summaries. Finally, the model is refined on these pseudo-summaries using a loss of symmetrical cross-transformation to manage noisy labels and improve adaptation.

The study assesses Vismap through three scenarios: the summary of long videos using Ego4D-HCAP, the generalization of the cross-domain on the data sets MSRVTT, MSVD and YouCook2, and the adaptation to short videos using Egoschema. Vismap, trained on an hour's videos, is compared to supervised and zero-shot methods, such as video summary and Lavila + GPT3.5, demonstrating competitive or superior performance without supervision. The evaluations use cider, red-l, meteor scores and the precision of the AQ. Ablation studies highlight the advantages of meta-counting modules and components, such as contrastive learning and SCE loss. The details of the implementation include the use of Timesformer, Distilbert and GPT-2, with training on a NVIDIA A100 GPU.

In conclusion, Vismap is an unleanished approach to summarize long videos using annotated short-video data sets and a meta-accounting strategy. It first creates high quality summaries thanks to the meta-counting, then leads to a summary model, reducing the need for in-depth annotations. The experimental results demonstrate that Vismap works on equal methods with fully supervised methods and adapts effectively in various video data sets. However, its dependence on the pseudo-calves of a source domain model can have an impact on performance under significant domain trips. In addition, Vismap is currently based solely on visual information. Future work could integrate multimodal data, introduce a hierarchical summary and develop more generalizable meta-counting techniques.


Discover the Paper. Also, don't forget to follow us Twitter And join our Telegram And Linkedin Group. Don't forget to join our 90K + ML Subdreddit.

🔥 (Register now) Minicon Virtual Conference on AIA: Free registration + presence certificate + 4 hours (May 21, 9 a.m. to 1 p.m. PST) + Practical workshop


Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.