From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models
Changming Xiao, Qi Yang, Feng Zhou, Changshui Zhang

TL;DR
This paper presents a novel method to localize image entities by leveraging attention mechanisms in text-to-image diffusion models, enabling semantic grounding without re-training and demonstrating superior performance in weakly-supervised segmentation tasks.
Contribution
The work introduces a simple, effective approach to extract word-pixel correlations from diffusion models' attention, applicable to semantic segmentation and personalized image segmentation tasks.
Findings
Achieves superior performance on Pascal VOC 2012 and MS COCO 2014 datasets.
Demonstrates generalizability of word-pixel correlations to customized generation methods.
Introduces a new task and dataset for personalized referring image segmentation.
Abstract
Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. In this work, a simple but effective method is proposed to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without re-training nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is found to be generalizable for the learned text embedding of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
MethodsDiffusion · Six Ways To Communicate To Someone At Expedia Via Phone And Email's. · Latent Diffusion Model
