Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models!
Arash Marioriyad, Mohammadali Banayeeanzade, Reza Abbasi and, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

TL;DR
This paper identifies that excessive overlap in cross-attention maps causes the entity missing problem in text-to-image diffusion models and proposes training-free loss functions to reduce overlap, significantly improving compositional accuracy.
Contribution
It introduces four novel, training-free loss functions to regulate attention overlap, effectively mitigating the entity missing problem in diffusion models.
Findings
Reduced attention overlap improves entity depiction accuracy.
Proposed methods outperform previous approaches in benchmarks.
Human evaluation scores increased by 9%.
Abstract
Text-to-image diffusion models, such as Stable Diffusion and DALL-E, are capable of generating high-quality, diverse, and realistic images from textual prompts. However, they sometimes struggle to accurately depict specific entities described in prompts, a limitation known as the entity missing problem in compositional generation. While prior studies suggested that adjusting cross-attention maps during the denoising process could alleviate this problem, they did not systematically investigate which objective functions could best address it. This study examines three potential causes of the entity-missing problem, focusing on cross-attention dynamics: (1) insufficient attention intensity for certain entities, (2) overly broad attention spread, and (3) excessive overlap between attention maps of different entities. We found that reducing overlap in attention maps between entities can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Mathematics, Computing, and Information Processing · Topic Modeling
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Diffusion
