Training-Free Structured Diffusion Guidance for Compositional   Text-to-Image Synthesis

Weixi Feng; Xuehai He; Tsu-Jui Fu; Varun Jampani; Arjun Akula,; Pradyumna Narayana; Sugato Basu; Xin Eric Wang; William Yang Wang

arXiv:2212.05032·cs.CV·March 2, 2023·70 cites

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula,, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, William Yang Wang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces a training-free method to enhance the compositional accuracy of text-to-image diffusion models by manipulating cross-attention layers based on linguistic structures, improving object attribute binding and layout coherence.

Contribution

It proposes a novel, training-free approach to improve compositional capabilities of diffusion models through structured cross-attention manipulation guided by linguistic insights.

Findings

01

Achieves 5-8% improvement in user preference studies.

02

Enhances attribute binding and object layout accuracy.

03

Requires no additional training data.

Abstract

Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

weixi-feng/structured-diffusion-guidance
pytorchOfficial

Datasets

AIML-TUDA/MCC-250
dataset· 60 dl
60 dl

Videos

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsDiffusion