Multimodal foundation models (MFM) such as GPT-4O, Gemini and Claude have recently shown rapid progress, especially in public demos. Although their linguistic skills are well studied, their true ability to understand visual information remains clear. Most of the benchmarks used today focus strongly on textual tasks, such as VQA or classification, which often reflect linguistic forces than visual capacities. These tests also require text releases, which makes it difficult to assess visual skills equally or compare MFM with models specific to vision. In addition, critical aspects such as 3D perception, segmentation and grouping, which are essential for visual understanding, are still widely ignored in current assessments.
MFMs have shown strong performance in tasks that combine visual and language understanding, such as subtitling and answers to visual questions. However, their effectiveness in tasks that require a detailed visual understanding remains uncertain. Most of the current benchmarks are based on text -based outings, which makes MFMs difficultly compare it with vision models only. Some studies are trying to adapt vision data sets for MFM by converting annotations into text, but this limitation restricts the evaluation to linguistic outputs. Incitated strategies have also been explored to help MFM fight against visual tasks by dividing them into manageable subtaches, although reproducibility remains a challenge in certain cases.
EPFL researchers have evaluated several models of popular multimodal foundations, such as GPT-4O, Gemini 2.0 Flash and Claude 3.5 Sonnet on the vision tasks of the basic computer, including segmentation, object detection and deep prediction, using data games like Coco and Imagenet. Given that most MFMs are designed to produce text and are only accessible via APIs, they have developed a quick capture frame to translate these visual tasks into compatible text formats. Their results show that even if MFMs are competent generalists, they are not specialized vision models, especially in geometric tasks. GPT-4O stood out, making the best in 4 of the 6 tasks. The evaluation tool box will be open.
To assess MFMs on vision tasks, the study designed a quick chaining strategy, dividing complex tasks into simpler and user-friendly subtasions. For example, instead of directly predicting the delimitation boxes, the model first identifies current objects, then locates them by the recursive image of image. For segmentation and grouping, the images are divided into superpixels, which are easier to label and compare. The depth and surface normal are estimated using rankings per pair of superpixel regions. This modular design exploits the resistance of MFMS in classification and similarity, while calibration controls guarantee fair comparisons. The method is flexible and the performance improves with a finer incentive.
The study assesses various MFM, notably GPT-4, Gemini Flash and Claude 3.5, on several tasks, such as image classification, object detection and segmentation. Using data sets such as Imagenet, Coco and Hypersim, the results show that GPT-4O reaching 77.2%on imagenet and 60.62 AP50 for object detection, outperformed by specialized models such as Vit-G (90.94%) and Co-Dettr (91.30%). The results of the semantic segmentation show GPT-4O at 44.89 Miou, while OneFormer leads with 65.52. MFMS GATHING DISTRIBUTION is changing fairly well but getting on specific visual reasoning. The study also introduces rapid chaining and oracle bases to assess higher performance.
In conclusion, the study introduces a comparative analysis framework to assess the visual capacities of MFM, such as GPT-4O, Gemini and Claude, by converting standard vision tasks into formats based on fast formats. The results show that MFMs work better on semantic tasks than geometric those, with GPT-4O leading overall. However, all MFMS are considerably lagging behind specific vision models. Although they are generalist trained mainly on image text data, they show promising progress, in particular more recent reasoning models, such as O3, on 3D tasks. The limitations include a high inference cost and rapid sensitivity. However, this framework provides a unified approach to assess the visual understanding of the MFMS, throwing the basics of future progress.
Discover the Paper,, GitHub page And Project. All the merit of this research goes to researchers in this project.
Meet the newsletter of AI dev read by 40K + developers and researchers from Nvidia, Openai, Deepmind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100 others (Subscribe now)
Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.