Masked-Attention Diffusion Guidance for Spatially Controlling   Text-to-Image Generation

Yuki Endo

arXiv:2308.06027·cs.CV·October 31, 2023

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

Yuki Endo

PDF

Open Access 1 Repo

TL;DR

This paper introduces a training-free method for spatially controlling text-to-image diffusion models by manipulating attention maps based on semantic masks, improving alignment with user-defined regions.

Contribution

We propose masked-attention guidance, a novel approach that enhances spatial control in diffusion-based image synthesis without additional training.

Findings

01

Achieves more accurate spatial control than baseline methods.

02

Effectively integrates with pre-trained diffusion models like Stable Diffusion.

03

Improves alignment of generated images with semantic masks.

Abstract

Text-to-image synthesis has achieved high-quality results with recent advances in diffusion models. However, text input alone has high spatial ambiguity and limited user controllability. Most existing methods allow spatial control through additional visual guidance (e.g., sketches and semantic masks) but require additional training with annotated images. In this paper, we propose a method for spatially controlling text-to-image generation without further training of diffusion models. Our method is based on the insight that the cross-attention maps reflect the positional relationship between words and pixels. Our aim is to control the attention maps according to given semantic masks and text prompts. To this end, we first explore a simple approach of directly swapping the cross-attention maps with constant maps computed from the semantic regions. Some prior works also allow training-free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

endo-yuki-t/mag
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsDiffusion