Towards Understanding Cross and Self-Attention in Stable Diffusion for   Text-Guided Image Editing

Bingyan Liu; Chengyu Wang; Tingfeng Cao; Kui Jia; Jun Huang

arXiv:2403.03431·cs.CV·March 7, 2024·2 cites

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

Bingyan Liu, Chengyu Wang, Tingfeng Cao, Kui Jia, Jun Huang

PDF

Open Access

TL;DR

This paper investigates the roles of cross and self-attention mechanisms in Stable Diffusion for text-guided image editing, revealing their distinct functions and proposing a simplified, more effective editing method.

Contribution

It provides a detailed analysis of attention maps in Stable Diffusion, clarifies their semantic roles, and introduces a simplified, tuning-free image editing approach based on self-attention.

Findings

01

Cross-attention maps contain object attribution information leading to editing failures.

02

Self-attention maps preserve geometric and shape details during editing.

03

Proposed method outperforms existing approaches on multiple datasets.

Abstract

Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have recently gained significant popularity for creative Text-to-image generation. Yet, for domain-specific scenarios, tuning-free Text-guided Image Editing (TIE) is of greater importance for application developers, which modify objects or object properties in images by manipulating feature components in attention layers during the generation process. However, little is known about what semantic meanings these attention layers have learned and which parts of the attention maps contribute to the success of image editing. In this paper, we conduct an in-depth probing analysis and demonstrate that cross-attention maps in Stable Diffusion often contain object attribution information that can result in editing failures. In contrast, self-attention maps play a crucial role in preserving the geometric and shape details of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Graphics and Visualization Techniques · Image Retrieval and Classification Techniques

MethodsDiffusion