Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

Yinghui Li; Jiayi Kuang; Peng Xing; Daixian Liu; Yongheng Zhang; Junnan Dong; Shu-Yu Guo; Yangning Li; Qingyu Zhou; Wenhao Jiang; Hai-Tao Zheng; Ying Shen; Liang Lin; Philip S. Yu

arXiv:2603.18472·cs.AI·April 10, 2026

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

Yinghui Li, Jiayi Kuang, Peng Xing, Daixian Liu, Yongheng Zhang, Junnan Dong, Shu-Yu Guo, Yangning Li, Qingyu Zhou, Wenhao Jiang, Hai-Tao Zheng, Ying Shen, Liang Lin, Philip S. Yu

PDF

1 Datasets

TL;DR

This paper introduces a comprehensive benchmark revealing that current multimodal large language models struggle with discrete symbol recognition, often relying on linguistic priors rather than true visual understanding.

Contribution

The authors present a multi-domain benchmark highlighting a persistent cognitive mismatch in MLLMs, emphasizing the need for grounded perception in discrete semantic spaces.

Findings

01

Models underperform on elementary symbol recognition tasks.

02

Models rely on linguistic priors and procedural reasoning instead of visual grounding.

03

Recognition-reasoning inversion is especially evident with sparse, low-redundancy symbols.

Abstract

Multimodal large language models (MLLMs) perform strongly on natural images, yet their ability to understand discrete visual symbols remains unclear. We present a multi-domain benchmark spanning language, culture, mathematics, physics and chemistry, organized into three cognitive levels: perception and recognition, combination and reasoning, and association and critical thinking. Across leading MLLMs, we observe a consistent cognitive mismatch. Models frequently underperform on elementary symbol recognition while appearing relatively competent on more complex reasoning tasks. This recognition-reasoning inversion indicates that current systems often compensate with linguistic priors, template retrieval or procedural reasoning instead of robust visual grounding. The pattern is especially clear for sparse, low-redundancy symbols such as handwritten characters, formula graphs, circuit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Eternity-gaga/SymbolBench
dataset· 66 dl
66 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.