TL;DR
This paper benchmarks multimodal foundation models like GPT-4o on standard computer vision tasks, revealing their strengths and limitations compared to specialized models, and introduces a standardized evaluation framework.
Contribution
It develops a prompt-based benchmarking framework for evaluating proprietary and open MFMs on diverse vision tasks, highlighting their generalist capabilities and shortcomings.
Findings
MFMs lag behind specialized models on all tasks.
GPT-4o is the top performer among non-reasoning models.
Semantic tasks are performed better than geometric tasks.
Abstract
Multimodal foundation models (MFMs), such as GPT-4o, have recently made remarkable progress. However, their detailed visual understanding beyond question answering remains unclear. In this paper, we benchmark popular MFMs (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet, etc). The main challenges in performing this analysis are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these by translating vision tasks into text-promptable, API-compatible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
