InViC: Intent-aware Visual Cues for Medical Visual Question Answering
Zhisong Wang, Ziyang Chen, Zanting Ye, Hongze Zhu, Yefeng Zheng, Yong Xia

TL;DR
This paper introduces InViC, a lightweight framework that enhances medical VQA by explicitly guiding large language models to focus on visual evidence through intent-aware cue tokens, improving reliability and accuracy.
Contribution
InViC presents a novel cue extraction and two-stage fine-tuning approach that effectively incorporates visual evidence into LLMs for medical VQA, reducing shortcut answers.
Findings
InViC improves accuracy on three Med-VQA benchmarks.
The cue-bottleneck training enhances visual evidence utilization.
InViC outperforms zero-shot and standard fine-tuning methods.
Abstract
Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
