BabyVision: Visual Reasoning Beyond Language

Liang Chen; Weichu Xie; Yiyan Liang; Hongfeng He; Hans Zhao; Zhibo Yang; Zhiqi Huang; Haoning Wu; Haoyu Lu; Y. charles; Yiping Bao; Yuantao Fan; Guopeng Li; Haiyang Shen; Xuanzhong Chen; Wendong Xu; Shuzheng Si; Zefan Cai; Wenhao Chai; Ziqi Huang; Fangfu Liu; Tianyu Liu; Baobao Chang; Xiaobo Hu; Kaiyuan Chen; Yixin Ren; Yang Liu; Yuan Gong; Kuan Li

arXiv:2601.06521·cs.CV·January 13, 2026

BabyVision: Visual Reasoning Beyond Language

Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y. charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu

PDF

Open Access 2 Datasets

TL;DR

BabyVision is a new benchmark designed to evaluate core visual reasoning abilities in multimodal language models independently of language, revealing significant gaps compared to human performance and highlighting the need for improved visual perception in AI.

Contribution

The paper introduces BabyVision, a comprehensive benchmark for assessing fundamental visual skills in multimodal models, and provides initial evaluations showing current models lag behind humans in basic visual reasoning.

Findings

01

Leading MLLMs perform significantly below human baseline.

02

Current models lack fundamental visual primitives despite language capabilities.

03

Progress in BabyVision advances toward human-level visual perception.

Abstract

While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Neurobiology of Language and Bilingualism