M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding
Juntao Jiang, Jiangning Zhang, Yali Bi, Jinsheng Bai, Weixuan Liu, Weiwei Jin, Zhucun Xue, Yong Liu, Xiaobin Hu, Shuicheng Yan

TL;DR
M3CoTBench is a comprehensive benchmark designed to evaluate the correctness, efficiency, impact, and consistency of chain-of-thought reasoning in multimodal large language models for medical image understanding, addressing a critical gap in current evaluation methods.
Contribution
This paper introduces M3CoTBench, a new benchmark with diverse datasets, tasks, and metrics specifically for assessing CoT reasoning in medical imaging AI systems.
Findings
Current MLLMs show limitations in reliable reasoning
Benchmark reveals gaps in interpretability and clinical trustworthiness
Provides insights for improving AI diagnostic models
Abstract
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. Such opaque reasoning processes lack reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper tackles an emerging yet underexplored topic: evaluating Chain-of-Thought reasoning in medical multimodal LLMs, which is both timely and relevant to advancing trustworthy medical AI. 2. The benchmark is validated on a broad range of both open- and closed-source MLLMs, providing a well-rounded comparison that highlights current model limitations and practical challenges in clinical reasoning.
1. The definition of CoT in the medical area is unclear. Although the paper claims that its Chain-of-Thought (CoT) formulation “mirrors clinicians’ cognitive workflow”, the reasoning template shown in the Appendix appears overly simplified. It typically only has four steps: examination type -> key features -> key conclusion -> additional analysis. It is unclear why this sequence represents a gold standard reasoning path in clinical diagnosis. Is it based on any references, such as guidelines in
1. This paper introduces M3CoTBench, encompassing 24 imaging modalities to evaluate MLLMs' understanding capabilities across diverse medical imaging contexts. 2. The benchmark introduces tailored metrics to assess reasoning quality across four dimensions: correctness of each reasoning step, efficiency cost, impact on final answer accuracy, and logical consistency—providing a more nuanced evaluation beyond traditional accuracy measures.
1. M3CoTBench spans 24 modalities and 13 task types, but contains only 1,079 image-based QA pairs. Given this broad coverage, does each category have sufficient samples? The paper does not appear to provide per-category statistics. 2. The benchmark’s dataset, while diverse, is relatively small (only 1079 Q&A pairs) compared to other medical VQA datasets, which may limit the statistical breadth of evaluation.
- Addresses Critical Gap: First comprehensive benchmark for CoT reasoning in medical imaging - important for clinical AI transparency and trust. - High-Quality Curation: - Diverse coverage: 24 modalities from 55 public datasets - Rigorous annotation: Multi-stage validation with medical experts - Clinical alignment: 4-step reasoning framework mirrors diagnostic workflows - Novel Evaluation Framework: Four dimensions (correctness, efficiency, impact, consistency) provide comprehensive CoT as
Methodological Concerns: - The dataset comprises only 1,079 images, relatively small compared to other medical reasoning benchmarks (e.g., OmniMedVQA with 118K+ images). - Potential Bias: Although reasoning steps undergo expert validation and revision, their initial generation by GPT-4o may introduce biases inherent to its reasoning style, which might persist despite subsequent human refinement. - Evaluation Circularity: The study uses GPT-4o both to generate reasoning chains and to evaluate th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare
