S-Chain: Structured Visual Chain-of-Thought For Medicine

Khai Le-Duc; Duy M. H. Nguyen; Phuong T. H. Trinh; Tien-Phat Nguyen; Nghiem T. Diep; An Ngo; Tung Vu; Trinh Vuong; Anh-Tien Nguyen; Mau Nguyen; Van Trung Hoang; Khai-Nguyen Nguyen; Hy Nguyen; Chris Ngo; Anji Liu; Nhat Ho; Anne-Christin Hauschild; Khanh Xuan Nguyen; Thanh Nguyen-Tang; Pengtao Xie; Daniel Sonntag; James Zou; Mathias Niepert; Anh Totti Nguyen

arXiv:2510.22728·cs.LG·October 28, 2025

S-Chain: Structured Visual Chain-of-Thought For Medicine

Khai Le-Duc, Duy M. H. Nguyen, Phuong T. H. Trinh, Tien-Phat Nguyen, Nghiem T. Diep, An Ngo, Tung Vu, Trinh Vuong, Anh-Tien Nguyen, Mau Nguyen, Van Trung Hoang, Khai-Nguyen Nguyen, Hy Nguyen, Chris Ngo, Anji Liu, Nhat Ho, Anne-Christin Hauschild, Khanh Xuan Nguyen

PDF

1 Models 1 Datasets 4 Reviews

TL;DR

S-Chain introduces a large-scale, expert-annotated dataset with structured visual reasoning for medical VLMs, significantly enhancing interpretability, grounding, and robustness in medical visual question answering.

Contribution

It provides the first large-scale dataset with structured visual CoT annotations in medical images, enabling improved reasoning and grounding in medical vision-language models.

Findings

01

SV-CoT supervision improves interpretability and grounding fidelity.

02

Benchmarking shows state-of-the-art models benefit from SV-CoT training.

03

Proposed mechanisms enhance alignment between visual evidence and reasoning.

Abstract

Faithful reasoning in medical vision-language models (VLMs) requires not only accurate predictions but also transparent alignment between textual rationales and visual evidence. While Chain-of-Thought (CoT) prompting has shown promise in medical visual question answering (VQA), no large-scale expert-level dataset has captured stepwise reasoning with precise visual grounding. We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT), explicitly linking visual regions to reasoning steps. The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability. Using S-Chain, we benchmark state-of-the-art medical VLMs (ExGra-Med, LLaVA-Med) and general-purpose VLMs (Qwen2.5-VL, InternVL2.5), showing that SV-CoT supervision significantly improves interpretability,…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 5

Strengths

1. A large-scale dataset 2. Expert-involved annotation pipeline

Weaknesses

1. The dataset mainly consists of Alzheimer’s disease (AD) MRI figures, introducing a significant bias that limits the generalization and undermines the broad claim of “Chain-of-Thought for Medicine.” 2. The proposed Chain-of-Thought (CoT) approach appears too rigid, resembling a predetermined analytical workflow rather than flexible, natural reasoning. 3. The results show Gemini 2.5 Flash performing much better than all other models, which seems unusual and raises concerns about the evaluation

Reviewer 02Rating 4Confidence 3

Strengths

1. S-Chain introduces the first expert-annotated Structured Visual Chain-of-Thought (SV-CoT) benchmark, covering 12k medical images across 16 languages, effectively filling the gap in evaluating visual–reasoning consistency within the medical domain. 2. The four-stage structured reasoning process (localization → description → grading → diagnosis) mirrors real clinical diagnostic logic, enabling models to generate traceable and interpretable reasoning paths while mitigating hallucinations and sem

Weaknesses

1. The dataset is primarily based on the OASIS Alzheimer’s MRI collection, resulting in a relatively narrow disease scope and imaging modality coverage, which limits generalization and transferability to broader clinical contexts. 2. The annotation process required approximately 700 hours of work by three medical experts, posing scalability challenges for expanding to diverse disease types or multi-center datasets in the future. 3. The methodological contribution is limited, leaning more toward

Reviewer 03Rating 6Confidence 5

Strengths

1. The major strength of this paper is the release of a clinically validated dataset, which ensures that all annotations are verified by medical experts. 2. While the technical contribution is somewhat limited, the efforts to design the four-stage reasoning framework, construct the dataset, and make it publicly available are highly valuable.

Weaknesses

1. The dataset is limited to MRI scans for dementia. 2. The dataset does not consider the volumetric (3D) characteristics of the original MRI scans.

Reviewer 04Rating 2Confidence 4

Strengths

1. High-quality expert annotation with 700 hours of 3-doctor consensus using standardized clinical scales (Scheltens/Pasquier/Koedam). The 100% inter-annotator agreement demonstrates rigorous quality control. 2. Evaluation across medical VLMs (ExGra-Med, LLaVA-Med) and general VLMs (Qwen2.5-VL, InternVL2.5) with informative ablations. 3. Clear empirical gains: 8-15% over base models and 4-5% over GPT-4 synthetic CoT. Multilingual support (16 languages) enhances accessibility, though the pract

Weaknesses

1. The paper addresses a single disease (Alzheimer's), single task (3-class severity grading), and a single modality (brain MRI), yet claims to establish principles "for medicine" broadly. In addition, the task is not differential diagnosis (AD vs. vascular dementia vs. Lewy body dementia vs. normal aging) but merely grading pre-diagnosed dementia patients into Non/Mild/Moderate severity levels. This is fundamentally a simpler task that bypasses the challenging diagnostic reasoning physicians ac

Code & Models

Models

🤗
leduckhai/S-Chain
model

Datasets

leduckhai/S-Chain
dataset· 465 dl
465 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.