Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models

Felix N\"utzel; Mischa Dombrowski; Bernhard Kainz

arXiv:2507.12236·cs.CV·July 17, 2025

Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models

Felix N\"utzel, Mischa Dombrowski, Bernhard Kainz

PDF

Open Access 1 Models

TL;DR

This paper demonstrates that generative text-to-image diffusion models, especially when fine-tuned with domain-specific language models and enhanced with a novel post-processing technique, significantly outperform discriminative methods in medical phrase grounding tasks.

Contribution

It introduces a new generative approach for phrase grounding in medical imaging, leveraging cross-attention maps and a novel post-processing method called Bimodal Bias Merging (BBM).

Findings

01

Generative diffusion models outperform discriminative methods in zero-shot phrase grounding.

02

Fine-tuning with domain-specific language models like CXR-BERT greatly improves performance.

03

The proposed BBM technique further refines localization accuracy.

Abstract

Phrase grounding, i.e., mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. To further enhance performance, we introduce Bimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
FelixNuetzel/cxr_bert_ldm
model· 2 dl
2 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques