Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs
Rohit Sinha, Aditya Kanade, Sai Srinivas Kancheti, Vineeth N Balasubramanian, Tanuja Ganu

TL;DR
This paper introduces 'Mind's Eye', a benchmark for evaluating multimodal LLMs on visuo-cognitive tasks inspired by human intelligence tests, revealing current models' limitations in visuospatial reasoning.
Contribution
The paper presents a new benchmark organized under the 'A-R-T' taxonomy to assess visuospatial reasoning in multimodal LLMs, inspired by classic human intelligence tests.
Findings
Humans achieve 80% accuracy on the benchmark.
Top MLLMs score below 50%, indicating limited visuospatial reasoning.
Failures are mainly in visual attention, perceptual manipulation, and abstraction.
Abstract
Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
