V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis

Yuan Wang; Jiaxiang Liu; Shujian Gao; Bin Feng; Zhihang Tang; Xiaotang Gai; Jian Wu; Zuozhu Liu

arXiv:2506.19610·cs.CE·June 30, 2025

V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis

Yuan Wang, Jiaxiang Liu, Shujian Gao, Bin Feng, Zhihang Tang, Xiaotang Gai, Jian Wu, Zuozhu Liu

PDF

Open Access 1 Repo

TL;DR

V2T-CoT introduces a novel multimodal approach that localizes disease-specific regions in biomedical images and generates explainable reasoning paths, significantly improving accuracy and interpretability in medical visual question answering.

Contribution

It automates disease region localization and integrates this into a reasoning framework, enhancing explainability and performance in Med-VQA tasks.

Findings

01

Achieves state-of-the-art results on four Med-VQA benchmarks

02

Improves interpretability through visual grounding and textual rationale

03

Enhances diagnostic accuracy with region-specific attention mechanisms

Abstract

Recent advances in multimodal techniques have led to significant progress in Medical Visual Question Answering (Med-VQA). However, most existing models focus on global image features rather than localizing disease-specific regions crucial for diagnosis. Additionally, current research tends to emphasize answer accuracy at the expense of the reasoning pathway, yet both are crucial for clinical decision-making. To address these challenges, we propose From Vision to Text Chain-of-Thought (V2T-CoT), a novel approach that automates the localization of preference areas within biomedical images and incorporates this localization into region-level pixel attention as knowledge for Vision CoT. By fine-tuning the vision language model on constructed R-Med 39K dataset, V2T-CoT provides definitive medical reasoning paths. V2T-CoT integrates visual grounding with textual rationale generation to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

venn2336/v2t_cot
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning