Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

Aditya Kanade; Tanuja Ganu

arXiv:2506.02022·cs.CV·December 11, 2025

Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

Aditya Kanade, Tanuja Ganu

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces a comprehensive benchmark to evaluate visual perception in multimodal large language models, revealing significant performance gaps and identifying key challenges like attention misallocation and representation instability.

Contribution

It presents a new scalable benchmark with diverse subtasks to systematically assess and analyze visual perception in MLLMs, highlighting critical weaknesses and areas for improvement.

Findings

01

Humans achieve 96.49% accuracy on the benchmark.

02

Top MLLMs score below 50%, with performance dropping as complexity increases.

03

Failures are linked to attention misallocation and unstable internal representations.

Abstract

Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce "Do You See Me", a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/do-you-see-me
noneOfficial

Datasets

microsoft/Do-You-See-Me
dataset· 602 dl
602 dl

Videos

Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs· underline

Taxonomy

TopicsColor perception and design · Safety Warnings and Signage · Data Visualization and Analytics