Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models
Felix N\"utzel, Mischa Dombrowski, Bernhard Kainz

TL;DR
This paper demonstrates that generative text-to-image diffusion models, especially when fine-tuned with domain-specific language models and enhanced with a novel post-processing technique, significantly outperform discriminative methods in medical phrase grounding tasks.
Contribution
It introduces a new generative approach for phrase grounding in medical imaging, leveraging cross-attention maps and a novel post-processing method called Bimodal Bias Merging (BBM).
Findings
Generative diffusion models outperform discriminative methods in zero-shot phrase grounding.
Fine-tuning with domain-specific language models like CXR-BERT greatly improves performance.
The proposed BBM technique further refines localization accuracy.
Abstract
Phrase grounding, i.e., mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. To further enhance performance, we introduce Bimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
