Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs
Aditya Kanade, Tanuja Ganu

TL;DR
This paper introduces a comprehensive benchmark to evaluate visual perception in multimodal large language models, revealing significant performance gaps and identifying key challenges like attention misallocation and representation instability.
Contribution
It presents a new scalable benchmark with diverse subtasks to systematically assess and analyze visual perception in MLLMs, highlighting critical weaknesses and areas for improvement.
Findings
Humans achieve 96.49% accuracy on the benchmark.
Top MLLMs score below 50%, with performance dropping as complexity increases.
Failures are linked to attention misallocation and unstable internal representations.
Abstract
Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce "Do You See Me", a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsColor perception and design · Safety Warnings and Signage · Data Visualization and Analytics
