MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

Jusheng Zhang; Kaitong Cai; Xiaoyang Guo; Sidi Liu; Qinhan Lv; Ruiqi Chen; Jing Yang; Yijia Fan; Xiaofei Sun; Jian Wang; Ziliang Chen; Liang Lin; and Keze Wang

arXiv:2512.08228·cs.CV·December 10, 2025

MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

Jusheng Zhang, Kaitong Cai, Xiaoyang Guo, Sidi Liu, Qinhan Lv, Ruiqi Chen, Jing Yang, Yijia Fan, Xiaofei Sun, Jian Wang, Ziliang Chen, Liang Lin, and Keze Wang

PDF

Open Access

TL;DR

MM-CoT is a new benchmark designed to evaluate whether multimodal models genuinely ground their reasoning in visual evidence and maintain logical coherence, addressing limitations of existing generation-focused benchmarks.

Contribution

The paper introduces MM-CoT, a diagnostic benchmark that assesses visual grounding and logical coherence in multimodal models' chain-of-thought reasoning, filling a critical evaluation gap.

Findings

01

Leading models struggle with visual grounding and logical coherence.

02

MM-CoT reveals a gap between generative fluency and reasoning fidelity.

03

Benchmark correlates poorly with existing metrics, indicating it measures unique reasoning aspects.

Abstract

The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Embodied and Extended Cognition · Constraint Satisfaction and Optimization