Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

Jen-Tse Huang; Dasen Dai; Jen-Yuan Huang; Youliang Yuan; Xiaoyuan Liu; Wenxuan Wang; Wenxiang Jiao; Pinjia He; Zhaopeng Tu; Haodong Duan

arXiv:2502.16435·cs.CV·May 5, 2026

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu, Haodong Duan

PDF

TL;DR

This paper introduces VisFactor, a new benchmark to evaluate foundational visual cognition in multimodal large language models, revealing significant gaps compared to human perception.

Contribution

The paper presents VisFactor, a systematic benchmark for assessing core visual cognitive abilities in MLLMs, highlighting their deficiencies in fundamental visual tasks.

Findings

01

Best model scores only 54.0% on VisFactor.

02

Models fail on mental rotation, spatial inference, and figure-ground tasks.

03

Performance does not significantly improve with model size or prompting.

Abstract

Humans develop perception through a bottom-up hierarchy: from basic primitives and Gestalt principles to high-level semantics. In contrast, current Multimodal Large Language Models (MLLMs) are trained directly on complex downstream tasks, often bypassing these foundational visual capabilities. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from FRCT, a well-established cognitive psychology assessment spanning four domains of human visual cognition. Furthermore, we design algorithms to automatically construct and validate unlimited test cases with controllable difficulty. Using VisFactor, we evaluate 39 frontier MLLMs, including both proprietary (e.g., GPT, Gemini) and open-source (e.g., LLaMA, Qwen) models. The best model achieves a score of only 54.0%. Analysis reveals good internal consistency (Cronbach's alpha =…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.