Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding
Yinghui Li, Jiayi Kuang, Peng Xing, Daixian Liu, Yongheng Zhang, Junnan Dong, Shu-Yu Guo, Yangning Li, Qingyu Zhou, Wenhao Jiang, Hai-Tao Zheng, Ying Shen, Liang Lin, Philip S. Yu

TL;DR
This paper introduces a comprehensive benchmark revealing that current multimodal large language models struggle with discrete symbol recognition, often relying on linguistic priors rather than true visual understanding.
Contribution
The authors present a multi-domain benchmark highlighting a persistent cognitive mismatch in MLLMs, emphasizing the need for grounded perception in discrete semantic spaces.
Findings
Models underperform on elementary symbol recognition tasks.
Models rely on linguistic priors and procedural reasoning instead of visual grounding.
Recognition-reasoning inversion is especially evident with sparse, low-redundancy symbols.
Abstract
Multimodal large language models (MLLMs) perform strongly on natural images, yet their ability to understand discrete visual symbols remains unclear. We present a multi-domain benchmark spanning language, culture, mathematics, physics and chemistry, organized into three cognitive levels: perception and recognition, combination and reasoning, and association and critical thinking. Across leading MLLMs, we observe a consistent cognitive mismatch. Models frequently underperform on elementary symbol recognition while appearing relatively competent on more complex reasoning tasks. This recognition-reasoning inversion indicates that current systems often compensate with linguistic priors, template retrieval or procedural reasoning instead of robust visual grounding. The pattern is especially clear for sparse, low-redundancy symbols such as handwritten characters, formula graphs, circuit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
