InViC: Intent-aware Visual Cues for Medical Visual Question Answering

Zhisong Wang; Ziyang Chen; Zanting Ye; Hongze Zhu; Yefeng Zheng; Yong Xia

arXiv:2603.16372·cs.CV·March 18, 2026

InViC: Intent-aware Visual Cues for Medical Visual Question Answering

Zhisong Wang, Ziyang Chen, Zanting Ye, Hongze Zhu, Yefeng Zheng, Yong Xia

PDF

Open Access

TL;DR

This paper introduces InViC, a lightweight framework that enhances medical VQA by explicitly guiding large language models to focus on visual evidence through intent-aware cue tokens, improving reliability and accuracy.

Contribution

InViC presents a novel cue extraction and two-stage fine-tuning approach that effectively incorporates visual evidence into LLMs for medical VQA, reducing shortcut answers.

Findings

01

InViC improves accuracy on three Med-VQA benchmarks.

02

The cue-bottleneck training enhances visual evidence utilization.

03

InViC outperforms zero-shot and standard fine-tuning methods.

Abstract

Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling