How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Rahul Ramachandran; Ali Garjani; Roman Bachmann; Andrei Atanov; O\u{g}uzhan Fatih Kar; Amir Zamir

arXiv:2507.01955·cs.CV·May 4, 2026

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Rahul Ramachandran, Ali Garjani, Roman Bachmann, Andrei Atanov, O\u{g}uzhan Fatih Kar, Amir Zamir

PDF

1 Repo 1 Video

TL;DR

This paper benchmarks multimodal foundation models like GPT-4o on standard computer vision tasks, revealing their strengths and limitations compared to specialized models, and introduces a standardized evaluation framework.

Contribution

It develops a prompt-based benchmarking framework for evaluating proprietary and open MFMs on diverse vision tasks, highlighting their generalist capabilities and shortcomings.

Findings

01

MFMs lag behind specialized models on all tasks.

02

GPT-4o is the top performer among non-reasoning models.

03

Semantic tasks are performed better than geometric tasks.

Abstract

Multimodal foundation models (MFMs), such as GPT-4o, have recently made remarkable progress. However, their detailed visual understanding beyond question answering remains unclear. In this paper, we benchmark popular MFMs (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet, etc). The main challenges in performing this analysis are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these by translating vision tasks into text-promptable, API-compatible…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

epfl-vilab/fm-vision-evals
github

Videos

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks· slideslive