Improving Compositional Attribute Binding in Text-to-Image Generative Models via Enhanced Text Embeddings
Arman Zarei, Keivan Rezaei, Samyadeep Basu, Mehrdad Saberi, Mazda, Moayeri, Priyatham Kattakinda, Soheil Feizi

TL;DR
This paper improves the accuracy of attribute-object composition in text-to-image models by fine-tuning a linear projection on CLIP embeddings, leading to better scene fidelity without affecting image quality scores.
Contribution
It introduces a simple, efficient method to enhance compositional attribute binding in diffusion models by optimizing CLIP text embeddings, addressing a key failure mode.
Findings
Enhanced compositional scene accuracy without increasing FID scores.
Fine-tuning a linear projection improves attribute-object binding.
Method requires only a small set of training pairs.
Abstract
Text-to-image diffusion-based generative models have the stunning ability to generate photo-realistic images and achieve state-of-the-art low FID scores on challenging image generation benchmarks. However, one of the primary failure modes of these text-to-image generative models is in composing attributes, objects, and their associated relationships accurately into an image. In our paper, we investigate compositional attribute binding failures, where the model fails to correctly associate descriptive attributes (such as color, shape, or texture) with the corresponding objects in the generated images, and highlight that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes. In particular, we show that (i) there exists an optimal text-embedding space that can generate highly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Topic Modeling · Image Retrieval and Classification Techniques
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
