Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts
Yifan Gao, Tao Zhou, Yi Zhou, Ke Zou, Yizhe Zhang, Huazhu Fu

TL;DR
This paper introduces KnowMVG, a framework that enhances spatial precision in medical visual grounding by integrating medical knowledge and a global-local attention mechanism, improving localization accuracy.
Contribution
It proposes a knowledge-guided prompting strategy and a global-local attention module to explicitly improve spatial awareness in vision-language models for medical image grounding.
Findings
Achieves 3.0% AP50 improvement over state-of-the-art.
Achieves 2.6% mIoU improvement over prior methods.
Validates effectiveness through extensive experiments and ablations.
Abstract
Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
