Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Yuan Wu; Zongxian Yang; Jiayu Qian; Songpan Gao; Guanxing Chen; Qiankun Li; Yu-An Huang; Zhi-An Huang

arXiv:2603.06665·cs.CV·April 13, 2026

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Yuan Wu, Zongxian Yang, Jiayu Qian, Songpan Gao, Guanxing Chen, Qiankun Li, Yu-An Huang, Zhi-An Huang

PDF

1 Repo

TL;DR

This paper investigates why chain-of-thought prompting often underperforms in medical vision-language tasks and proposes inference-time grounding interventions to improve model accuracy.

Contribution

It identifies a medical perception bottleneck and introduces perception and description grounding techniques to enhance visual grounding in clinical VLMs.

Findings

01

Grounding interventions improve accuracy across benchmarks.

02

Interventions mitigate chain-of-thought degradation.

03

In some cases, interventions reverse the chain-of-thought versus direct answering trend.

Abstract

Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TianYin123/Better_Eyes_Better_Thoughts
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.