Pixels, Patterns, but No Poetry: To See The World like Humans

Hongcheng Gao; Zihao Huang; Lin Xu; Jingyi Tang; Xinhao Li; Yue Liu; Haoyang Li; Taihang Hu; Minhua Lin; Xinlong Yang; Ge Wu; Balong Bi; Hongyu Chen; Wentao Zhang

arXiv:2507.16863·cs.CV·July 24, 2025

Pixels, Patterns, but No Poetry: To See The World like Humans

Hongcheng Gao, Zihao Huang, Lin Xu, Jingyi Tang, Xinhao Li, Yue Liu, Haoyang Li, Taihang Hu, Minhua Lin, Xinlong Yang, Ge Wu, Balong Bi, Hongyu Chen, Wentao Zhang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces the Turing Eye Test, a new benchmark for evaluating whether Multimodal Large Language Models perceive the world like humans, revealing current models' perceptual shortcomings despite reasoning strengths.

Contribution

The paper presents the Turing Eye Test, a perception-focused benchmark for MLLMs, highlighting their failures in human-like perception and the importance of vision tower fine-tuning.

Findings

01

State-of-the-art MLLMs fail on perceptual tasks trivial for humans.

02

In-context learning and language training do not improve perceptual performance.

03

Fine-tuning the vision tower enables rapid perceptual adaptation.

Abstract

Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

HongchengGao/TuringEyeTest
dataset· 241 dl
241 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Neurobiology of Language and Bilingualism