M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

Juntao Jiang; Jiangning Zhang; Yali Bi; Jinsheng Bai; Weixuan Liu; Weiwei Jin; Zhucun Xue; Yong Liu; Xiaobin Hu; Shuicheng Yan

arXiv:2601.08758·eess.IV·March 24, 2026

M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

Juntao Jiang, Jiangning Zhang, Yali Bi, Jinsheng Bai, Weixuan Liu, Weiwei Jin, Zhucun Xue, Yong Liu, Xiaobin Hu, Shuicheng Yan

PDF

Open Access 3 Reviews

TL;DR

M3CoTBench is a comprehensive benchmark designed to evaluate the correctness, efficiency, impact, and consistency of chain-of-thought reasoning in multimodal large language models for medical image understanding, addressing a critical gap in current evaluation methods.

Contribution

This paper introduces M3CoTBench, a new benchmark with diverse datasets, tasks, and metrics specifically for assessing CoT reasoning in medical imaging AI systems.

Findings

01

Current MLLMs show limitations in reliable reasoning

02

Benchmark reveals gaps in interpretability and clinical trustworthiness

03

Provides insights for improving AI diagnostic models

Abstract

Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. Such opaque reasoning processes lack reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 5

Strengths

1. The paper tackles an emerging yet underexplored topic: evaluating Chain-of-Thought reasoning in medical multimodal LLMs, which is both timely and relevant to advancing trustworthy medical AI. 2. The benchmark is validated on a broad range of both open- and closed-source MLLMs, providing a well-rounded comparison that highlights current model limitations and practical challenges in clinical reasoning.

Weaknesses

1. The definition of CoT in the medical area is unclear. Although the paper claims that its Chain-of-Thought (CoT) formulation “mirrors clinicians’ cognitive workflow”, the reasoning template shown in the Appendix appears overly simplified. It typically only has four steps: examination type -> key features -> key conclusion -> additional analysis. It is unclear why this sequence represents a gold standard reasoning path in clinical diagnosis. Is it based on any references, such as guidelines in

Reviewer 02Rating 4Confidence 3

Strengths

1. This paper introduces M3CoTBench, encompassing 24 imaging modalities to evaluate MLLMs' understanding capabilities across diverse medical imaging contexts. 2. The benchmark introduces tailored metrics to assess reasoning quality across four dimensions: correctness of each reasoning step, efficiency cost, impact on final answer accuracy, and logical consistency—providing a more nuanced evaluation beyond traditional accuracy measures.

Weaknesses

1. M3CoTBench spans 24 modalities and 13 task types, but contains only 1,079 image-based QA pairs. Given this broad coverage, does each category have sufficient samples? The paper does not appear to provide per-category statistics. 2. The benchmark’s dataset, while diverse, is relatively small (only 1079 Q&A pairs) compared to other medical VQA datasets, which may limit the statistical breadth of evaluation.

Reviewer 03Rating 6Confidence 4

Strengths

- Addresses Critical Gap: First comprehensive benchmark for CoT reasoning in medical imaging - important for clinical AI transparency and trust. - High-Quality Curation: - Diverse coverage: 24 modalities from 55 public datasets - Rigorous annotation: Multi-stage validation with medical experts - Clinical alignment: 4-step reasoning framework mirrors diagnostic workflows - Novel Evaluation Framework: Four dimensions (correctness, efficiency, impact, consistency) provide comprehensive CoT as

Weaknesses

Methodological Concerns: - The dataset comprises only 1,079 images, relatively small compared to other medical reasoning benchmarks (e.g., OmniMedVQA with 118K+ images). - Potential Bias: Although reasoning steps undergo expert validation and revision, their initial generation by GPT-4o may introduce biases inherent to its reasoning style, which might persist despite subsequent human refinement. - Evaluation Circularity: The study uses GPT-4o both to generate reasoning chains and to evaluate th

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare