Are MLMs Trapped in the Visual Room?

Yazhou Zhang; Chunwang Zou; Qimeng Liu; Lu Rong; Ben Yao; Zheng Lian; Qiuchi Li; Peng Zhang; Jing Qin

arXiv:2505.23272·cs.CV·June 2, 2025

Are MLMs Trapped in the Visual Room?

Yazhou Zhang, Chunwang Zou, Qimeng Liu, Lu Rong, Ben Yao, Zheng Lian, Qiuchi Li, Peng Zhang, Jing Qin

PDF

Open Access

TL;DR

This paper questions whether multi-modal large models truly understand visual content by introducing a two-tier evaluation framework and a sarcasm dataset, revealing a gap between perception accuracy and understanding ability.

Contribution

It proposes a novel perception-cognition evaluation framework and a high-quality sarcasm dataset to empirically assess MLMs' understanding beyond surface perception.

Findings

01

MLMs show high accuracy in visual perception

02

MLMs have about 17.1% error in sarcasm understanding despite correct perception

03

Weaknesses identified in context integration, emotional reasoning, and pragmatic inference

Abstract

Can multi-modal large models (MLMs) that can ``see'' an image be said to ``understand'' it? Drawing inspiration from Searle's Chinese Room, we propose the \textbf{Visual Room} argument: a system may process and describe every detail of visual inputs by following algorithmic rules, without genuinely comprehending the underlying intention. This dilemma challenges the prevailing assumption that perceptual mastery implies genuine understanding. In implementation, we introduce a two-tier evaluation framework spanning perception and cognition. The perception component evaluates whether MLMs can accurately capture the surface-level details of visual contents, where the cognitive component examines their ability to infer sarcasm polarity. To support this framework, We further introduce a high-quality multi-modal sarcasm dataset comprising both 924 static images and 100 dynamic videos. All…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsChemokine receptors and signaling