KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?

Xianfeng Wang; Kaiwei Zhang; Qi Jia; Zijian Chen; Guangtao Zhai; Xiongkuo Min

arXiv:2601.08292·cs.CV·January 14, 2026

KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?

Xianfeng Wang, Kaiwei Zhang, Qi Jia, Zijian Chen, Guangtao Zhai, Xiongkuo Min

PDF

Open Access

TL;DR

This paper introduces KidVis, a benchmark based on human visual development, revealing that current multimodal large language models lack fundamental visual primitives compared to children, and scaling model size alone does not improve these capabilities.

Contribution

The paper presents KidVis, a novel benchmark to evaluate basic visual primitives in MLLMs, highlighting their deficiencies relative to human children and uncovering the limitations of scaling model size.

Findings

01

Children score around 95.32 on KidVis tasks.

02

GPT-5 scores only 67.33, showing a performance gap.

03

Scaling model size does not linearly improve visual primitives.

Abstract

While Multimodal Large Language Models (MLLMs) have demonstrated impressive proficiency in high-level reasoning tasks, such as complex diagrammatic interpretation, it remains an open question whether they possess the fundamental visual primitives comparable to human intuition. To investigate this, we introduce KidVis, a novel benchmark grounded in the theory of human visual development. KidVis deconstructs visual intelligence into six atomic capabilities - Concentration, Tracking, Discrimination, Memory, Spatial, and Closure - already possessed by 6-7 year old children, comprising 10 categories of low-semantic-dependent visual tasks. Evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity. Results indicate that while human children achieve a near-perfect average score of 95.32, the state-of-the-art GPT-5 attains only 67.33.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Child and Animal Learning Development