What is the Visual Cognition Gap between Humans and Multimodal LLMs?
Xu Cao, Yifan Shen, Bolin Lai, Wenqian Ye, Yunsheng Ma, Joerg Heintz, Jintai Chen, Meihuan Huang, Jianguo Cao, Aidong Zhang, James M. Rehg

TL;DR
This paper introduces MaRs-VQA, a new dataset inspired by Raven's Matrices, to evaluate the visual cognition of multimodal large language models and compares their performance with human cognition, revealing current limitations.
Contribution
The paper presents MaRs-VQA for assessing visual reasoning in MLLMs and fine-tunes a baseline model, Qwen2-VCog, to improve their visual cognition capabilities.
Findings
MLLMs lag behind humans in matrix reasoning tasks.
The MaRs-VQA dataset enables standardized evaluation of visual cognition.
Fine-tuned Qwen2-VCog shows improved reasoning but still has limitations.
Abstract
Recently, Multimodal Large Language Models (MLLMs) and Vision Language Models (VLMs) have shown great promise in language-guided perceptual tasks such as recognition, segmentation, and object detection. However, their effectiveness in addressing visual cognition problems that require high-level multi-image reasoning and visual working memory is not well-established. One such challenge is matrix reasoning - the cognitive ability to discern relationships among patterns in a set of images and extrapolate to predict subsequent patterns. This skill is crucial during the early neurodevelopmental stages of children. Inspired by the matrix reasoning tasks in Raven's Progressive Matrices (RPM) and Wechsler Intelligence Scale for Children (WISC), we propose a new dataset MaRs-VQA to evaluate the visual cognition capability of MLLMs and compare their performance with existing human visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsSparse Evolutionary Training
