Visual Room 2.0: Seeing is Not Understanding for MLLMs
Haokun Li, Yazhou Zhang, Jizhi Ding, Qiuchi Li, Peng Zhang

TL;DR
This paper introduces Visual Room 2.0, a hierarchical benchmark to evaluate perception and cognition in multi-modal large language models, revealing their perceptual strengths but limited cognitive understanding and the non-causal link between perception and cognition.
Contribution
The paper proposes a new hierarchical benchmark for perception-cognition alignment in MLLMs and provides empirical analysis across 17 tasks, highlighting the gap between perception and understanding.
Findings
MLLMs are better at perception than cognition.
Cognition does not depend causally on perception-based reasoning.
Cognition improves with model size, perception does not necessarily do so.
Abstract
Can multi-modal large language models (MLLMs) truly understand what they can see? Extending Searle's Chinese Room into the multi-modal domain, this paper proposes the Visual Room argument: MLLMs may describe every visual detail precisely yet fail to comprehend the underlying emotions and intentions, namely seeing is not understanding. Building on this, we introduce \textit{Visual Room} 2.0, a hierarchical benchmark for evaluating perception-cognition alignment of MLLMs. We model human perceptive and cognitive processes across three levels: low, middle, and high, covering 17 representative tasks. The perception component ranges from attribute recognition to scene understanding, while the cognition component extends from textual entailment to causal and social reasoning. The dataset contains 350 multi-modal samples, each with six progressive questions (2,100 in total) spanning perception…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Multisensory perception and integration
