Text Embedding is Not All You Need: Attention Control for Text-to-Image   Semantic Alignment with Text Self-Attention Maps

Jeeyung Kim; Erfan Esmaeili; Qiang Qiu

arXiv:2411.15236·cs.CV·November 26, 2024

Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps

Jeeyung Kim, Erfan Esmaeili, Qiang Qiu

PDF

Open Access

TL;DR

This paper identifies limitations in current text-to-image models related to syntactic and attribute binding issues and proposes a test-time optimization method to improve semantic alignment by leveraging text attention maps.

Contribution

It introduces a novel approach that transfers syntactic relations from text attention maps to the cross-attention module via test-time optimization, enhancing image-text alignment.

Findings

01

Improved semantic alignment in generated images.

02

Better attribute and object binding accuracy.

03

Enhanced model robustness across diverse prompts.

Abstract

In text-to-image diffusion models, the cross-attention map of each text token indicates the specific image regions attended. Comparing these maps of syntactically related tokens provides insights into how well the generated image reflects the text prompt. For example, in the prompt, "a black car and a white clock", the cross-attention maps for "black" and "car" should focus on overlapping regions to depict a black car, while "car" and "clock" should not. Incorrect overlapping in the maps generally produces generation flaws such as missing objects and incorrect attribute binding. Our study makes the key observations investigating this issue in the existing text-to-image models:(1) the similarity in text embeddings between different tokens -- used as conditioning inputs -- can cause their cross-attention maps to focus on the same image regions; and (2) text embeddings often fail to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Semantic Web and Ontologies

MethodsAttention Is All You Need · Concatenated Skip Connection · Softmax · Diffusion · Focus