TL;DR
This paper investigates why chain-of-thought prompting often underperforms in medical vision-language tasks and proposes inference-time grounding interventions to improve model accuracy.
Contribution
It identifies a medical perception bottleneck and introduces perception and description grounding techniques to enhance visual grounding in clinical VLMs.
Findings
Grounding interventions improve accuracy across benchmarks.
Interventions mitigate chain-of-thought degradation.
In some cases, interventions reverse the chain-of-thought versus direct answering trend.
Abstract
Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
