MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Kaiyuan Zhang; Chenghao Yang; Zhoufutu Wen; Sihang Yuan; Qiuyue Wang; Chaoyi Huang; Guosheng Zhu; He Wang; Huawenyu Lu; Jianing Wen; Jianpeng Jiao; Lishu Luo; Longxiang Liu; Sijin Wu; Xiaolei Zhu; Xuanliang Zhang; Yu Liu; Ge Zhang; Yi Lin; Guang Shi; Chaoyou Fu; Wenhao Huang

arXiv:2511.03146·cs.CL·December 30, 2025

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Kaiyuan Zhang, Chenghao Yang, Zhoufutu Wen, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Yu Liu, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang

PDF

Open Access 1 Datasets 4 Reviews

TL;DR

This paper introduces MME-CC, a comprehensive benchmark for evaluating vision-centric cognitive reasoning in multimodal models, revealing current limitations and error patterns across 16 models.

Contribution

The paper presents MME-CC, a new benchmark organizing reasoning tasks into spatial, geometric, and knowledge-based categories to systematically assess MLLMs' cognitive capacities.

Findings

01

Closed-source models outperform open-source ones.

02

Spatial and geometric reasoning are notably weak.

03

Common errors include orientation mistakes and fragile cross-view identity.

Abstract

As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. This benchmark is a valuable contribution to the burgeoning domain of multimodal benchmarks evaluating spatial reasoning. 2. The problem annotation and review process is sound and clear. 3. The analysis of error patterns is convincing and a useful avenue of exploration.

Weaknesses

1. The work makes no mention of other benchmarks in the domain, or compare against similar benchmarks, or benchmarks used for a subset of the tasks compiled in this work (for example, EmbSpatial, Space3D-Bench for spatial reasoning, PolyMATH for geometric reasoning, and the many VQA datasets). This work could benefit from a better demonstration of how the problems in this benchmark are comparable or superior to existing benchmarks collated, perhaps as a more comprehensive extension to Table 3. 2

Reviewer 02Rating 2Confidence 4

Strengths

1) Eleven subtasks (e.g., Satellite-Map matching, Indoor dedup-counting, Maze, Unblock Me, Counterfactual) cover some breadth 2) Human-in-the-loop construction, expert-only validation for tricky subtasks, standardized post-processing, and model-based filtering to remove trivial/ambiguous items.

Weaknesses

1. While the use of an LLM-as-a-judge (DeepSeek-V3-0324) offers scalability and speed in evaluation, relying on a single model for correctness judgment introduces the risk of systematic bias. The authors mention a 95% agreement with human annotators across 99 samples, but this sample size is relatively small given the dataset’s size (1,173 items). There is no report of inter-rater reliability metrics (e.g., Cohen’s Kappa) nor a cross-model evaluation with alternative judges (e.g., GPT-4, Claude,

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper is clearly organized and easy to follow. The benchmark taxonomy and pipeline descriptions are detailed and transparent. 2. MME-CC includes a diverse set of vision-centric reasoning tasks, systematically categorized by cognitive dimensions (spatial, geometric, visual-knowledge). 3. The paper evaluates 16 SOTA MLLMs, with thoughtful analyses of reasoning performance, scaling trends and error types.

Weaknesses

1. While MME-CC is well-engineered, it largely repackages existing vision reasoning types under a new taxonomy. Prior works also emphasize visual reasoning or multimodal cognition. Compared with benchmarks like ZeroBench or MMStar, the main advance seems to be categorization and dataset curation rather than a fundamentally new evaluation concept. 2. The paper claims MME-CC focuses on “vision-based cognitive capacity” and is “language-independent,” but many tasks still depend on textual prompts f

Reviewer 04Rating 4Confidence 5

Strengths

+ Evaluated 16 MLLMs, covering major frontier models. + All collected test cases undergo human verification (at least 2 annotators).

Weaknesses

- The benchmark contains 11 tasks, categorized into Spatial Information, Geometric Information, and Visual Knowledge Reasoning. The major weakness in this paper is that the authors do not address the “why” issue. Why do you pick these three major directions? Are they essential in human cognition (are they prerequisites of other downstream tasks)? Or are they key performance indicator of some human abilities? Why do you pick the 11 tasks for the three directions? Are they sufficiently representat

Code & Models

Datasets

MaxwellWen/MME-CC
dataset· 3.5k dl
3.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Action Observation and Synchronization