Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts

Yifan Gao; Tao Zhou; Yi Zhou; Ke Zou; Yizhe Zhang; Huazhu Fu

arXiv:2604.01915·cs.CV·April 3, 2026

Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts

Yifan Gao, Tao Zhou, Yi Zhou, Ke Zou, Yizhe Zhang, Huazhu Fu

PDF

TL;DR

This paper introduces KnowMVG, a framework that enhances spatial precision in medical visual grounding by integrating medical knowledge and a global-local attention mechanism, improving localization accuracy.

Contribution

It proposes a knowledge-guided prompting strategy and a global-local attention module to explicitly improve spatial awareness in vision-language models for medical image grounding.

Findings

01

Achieves 3.0% AP50 improvement over state-of-the-art.

02

Achieves 2.6% mIoU improvement over prior methods.

03

Validates effectiveness through extensive experiments and ablations.

Abstract

Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.