What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Xu Cao; Yifan Shen; Bolin Lai; Wenqian Ye; Yunsheng Ma; Joerg Heintz; Jintai Chen; Meihuan Huang; Jianguo Cao; Aidong Zhang; James M. Rehg

arXiv:2406.10424·cs.CV·September 16, 2025

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Xu Cao, Yifan Shen, Bolin Lai, Wenqian Ye, Yunsheng Ma, Joerg Heintz, Jintai Chen, Meihuan Huang, Jianguo Cao, Aidong Zhang, James M. Rehg

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces MaRs-VQA, a new dataset inspired by Raven's Matrices, to evaluate the visual cognition of multimodal large language models and compares their performance with human cognition, revealing current limitations.

Contribution

The paper presents MaRs-VQA for assessing visual reasoning in MLLMs and fine-tunes a baseline model, Qwen2-VCog, to improve their visual cognition capabilities.

Findings

01

MLLMs lag behind humans in matrix reasoning tasks.

02

The MaRs-VQA dataset enables standardized evaluation of visual cognition.

03

Fine-tuned Qwen2-VCog shows improved reasoning but still has limitations.

Abstract

Recently, Multimodal Large Language Models (MLLMs) and Vision Language Models (VLMs) have shown great promise in language-guided perceptual tasks such as recognition, segmentation, and object detection. However, their effectiveness in addressing visual cognition problems that require high-level multi-image reasoning and visual working memory is not well-established. One such challenge is matrix reasoning - the cognitive ability to discern relationships among patterns in a set of images and extrapolate to predict subsequent patterns. This skill is crucial during the early neurodevelopmental stages of children. Inspired by the matrix reasoning tasks in Raven's Progressive Matrices (RPM) and Wechsler Intelligence Scale for Children (WISC), we propose a new dataset MaRs-VQA to evaluate the visual cognition capability of MLLMs and compare their performance with existing human visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IrohXu/VCog-Bench
pytorchOfficial

Datasets

IrohXu/VCog-Bench
dataset· 43 dl
43 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsSparse Evolutionary Training