Are MLMs Trapped in the Visual Room?
Yazhou Zhang, Chunwang Zou, Qimeng Liu, Lu Rong, Ben Yao, Zheng Lian, Qiuchi Li, Peng Zhang, Jing Qin

TL;DR
This paper questions whether multi-modal large models truly understand visual content by introducing a two-tier evaluation framework and a sarcasm dataset, revealing a gap between perception accuracy and understanding ability.
Contribution
It proposes a novel perception-cognition evaluation framework and a high-quality sarcasm dataset to empirically assess MLMs' understanding beyond surface perception.
Findings
MLMs show high accuracy in visual perception
MLMs have about 17.1% error in sarcasm understanding despite correct perception
Weaknesses identified in context integration, emotional reasoning, and pragmatic inference
Abstract
Can multi-modal large models (MLMs) that can ``see'' an image be said to ``understand'' it? Drawing inspiration from Searle's Chinese Room, we propose the \textbf{Visual Room} argument: a system may process and describe every detail of visual inputs by following algorithmic rules, without genuinely comprehending the underlying intention. This dilemma challenges the prevailing assumption that perceptual mastery implies genuine understanding. In implementation, we introduce a two-tier evaluation framework spanning perception and cognition. The perception component evaluates whether MLMs can accurately capture the surface-level details of visual contents, where the cognitive component examines their ability to infer sarcasm polarity. To support this framework, We further introduce a high-quality multi-modal sarcasm dataset comprising both 924 static images and 100 dynamic videos. All…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsChemokine receptors and signaling
