From Text to Mask: Localizing Entities Using the Attention of   Text-to-Image Diffusion Models

Changming Xiao; Qi Yang; Feng Zhou; Changshui Zhang

arXiv:2309.04109·cs.CV·October 2, 2024

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Changming Xiao, Qi Yang, Feng Zhou, Changshui Zhang

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel method to localize image entities by leveraging attention mechanisms in text-to-image diffusion models, enabling semantic grounding without re-training and demonstrating superior performance in weakly-supervised segmentation tasks.

Contribution

The work introduces a simple, effective approach to extract word-pixel correlations from diffusion models' attention, applicable to semantic segmentation and personalized image segmentation tasks.

Findings

01

Achieves superior performance on Pascal VOC 2012 and MS COCO 2014 datasets.

02

Demonstrates generalizability of word-pixel correlations to customized generation methods.

03

Introduces a new task and dataset for personalized referring image segmentation.

Abstract

Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. In this work, a simple but effective method is proposed to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without re-training nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is found to be generalizable for the learned text embedding of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Big-Brother-Pikachu/Text2Mask
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsDiffusion · Six Ways To Communicate To Someone At Expedia Via Phone And Email's. · Latent Diffusion Model