Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula,, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, William Yang Wang

TL;DR
This paper introduces a training-free method to enhance the compositional accuracy of text-to-image diffusion models by manipulating cross-attention layers based on linguistic structures, improving object attribute binding and layout coherence.
Contribution
It proposes a novel, training-free approach to improve compositional capabilities of diffusion models through structured cross-attention manipulation guided by linguistic insights.
Findings
Achieves 5-8% improvement in user preference studies.
Enhances attribute binding and object layout accuracy.
Requires no additional training data.
Abstract
Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsDiffusion
