S-Chain: Structured Visual Chain-of-Thought For Medicine
Khai Le-Duc, Duy M. H. Nguyen, Phuong T. H. Trinh, Tien-Phat Nguyen, Nghiem T. Diep, An Ngo, Tung Vu, Trinh Vuong, Anh-Tien Nguyen, Mau Nguyen, Van Trung Hoang, Khai-Nguyen Nguyen, Hy Nguyen, Chris Ngo, Anji Liu, Nhat Ho, Anne-Christin Hauschild, Khanh Xuan Nguyen

TL;DR
S-Chain introduces a large-scale, expert-annotated dataset with structured visual reasoning for medical VLMs, significantly enhancing interpretability, grounding, and robustness in medical visual question answering.
Contribution
It provides the first large-scale dataset with structured visual CoT annotations in medical images, enabling improved reasoning and grounding in medical vision-language models.
Findings
SV-CoT supervision improves interpretability and grounding fidelity.
Benchmarking shows state-of-the-art models benefit from SV-CoT training.
Proposed mechanisms enhance alignment between visual evidence and reasoning.
Abstract
Faithful reasoning in medical vision-language models (VLMs) requires not only accurate predictions but also transparent alignment between textual rationales and visual evidence. While Chain-of-Thought (CoT) prompting has shown promise in medical visual question answering (VQA), no large-scale expert-level dataset has captured stepwise reasoning with precise visual grounding. We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT), explicitly linking visual regions to reasoning steps. The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability. Using S-Chain, we benchmark state-of-the-art medical VLMs (ExGra-Med, LLaVA-Med) and general-purpose VLMs (Qwen2.5-VL, InternVL2.5), showing that SV-CoT supervision significantly improves interpretability,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. A large-scale dataset 2. Expert-involved annotation pipeline
1. The dataset mainly consists of Alzheimer’s disease (AD) MRI figures, introducing a significant bias that limits the generalization and undermines the broad claim of “Chain-of-Thought for Medicine.” 2. The proposed Chain-of-Thought (CoT) approach appears too rigid, resembling a predetermined analytical workflow rather than flexible, natural reasoning. 3. The results show Gemini 2.5 Flash performing much better than all other models, which seems unusual and raises concerns about the evaluation
1. S-Chain introduces the first expert-annotated Structured Visual Chain-of-Thought (SV-CoT) benchmark, covering 12k medical images across 16 languages, effectively filling the gap in evaluating visual–reasoning consistency within the medical domain. 2. The four-stage structured reasoning process (localization → description → grading → diagnosis) mirrors real clinical diagnostic logic, enabling models to generate traceable and interpretable reasoning paths while mitigating hallucinations and sem
1. The dataset is primarily based on the OASIS Alzheimer’s MRI collection, resulting in a relatively narrow disease scope and imaging modality coverage, which limits generalization and transferability to broader clinical contexts. 2. The annotation process required approximately 700 hours of work by three medical experts, posing scalability challenges for expanding to diverse disease types or multi-center datasets in the future. 3. The methodological contribution is limited, leaning more toward
1. The major strength of this paper is the release of a clinically validated dataset, which ensures that all annotations are verified by medical experts. 2. While the technical contribution is somewhat limited, the efforts to design the four-stage reasoning framework, construct the dataset, and make it publicly available are highly valuable.
1. The dataset is limited to MRI scans for dementia. 2. The dataset does not consider the volumetric (3D) characteristics of the original MRI scans.
1. High-quality expert annotation with 700 hours of 3-doctor consensus using standardized clinical scales (Scheltens/Pasquier/Koedam). The 100% inter-annotator agreement demonstrates rigorous quality control. 2. Evaluation across medical VLMs (ExGra-Med, LLaVA-Med) and general VLMs (Qwen2.5-VL, InternVL2.5) with informative ablations. 3. Clear empirical gains: 8-15% over base models and 4-5% over GPT-4 synthetic CoT. Multilingual support (16 languages) enhances accessibility, though the pract
1. The paper addresses a single disease (Alzheimer's), single task (3-class severity grading), and a single modality (brain MRI), yet claims to establish principles "for medicine" broadly. In addition, the task is not differential diagnosis (AD vs. vascular dementia vs. Lewy body dementia vs. normal aging) but merely grading pre-diagnosed dementia patients into Non/Mild/Moderate severity levels. This is fundamentally a simpler task that bypasses the challenging diagnostic reasoning physicians ac
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
