Improving Compositional Attribute Binding in Text-to-Image Generative   Models via Enhanced Text Embeddings

Arman Zarei; Keivan Rezaei; Samyadeep Basu; Mehrdad Saberi; Mazda; Moayeri; Priyatham Kattakinda; Soheil Feizi

arXiv:2406.07844·cs.CV·March 26, 2025

Improving Compositional Attribute Binding in Text-to-Image Generative Models via Enhanced Text Embeddings

Arman Zarei, Keivan Rezaei, Samyadeep Basu, Mehrdad Saberi, Mazda, Moayeri, Priyatham Kattakinda, Soheil Feizi

PDF

Open Access 1 Repo

TL;DR

This paper improves the accuracy of attribute-object composition in text-to-image models by fine-tuning a linear projection on CLIP embeddings, leading to better scene fidelity without affecting image quality scores.

Contribution

It introduces a simple, efficient method to enhance compositional attribute binding in diffusion models by optimizing CLIP text embeddings, addressing a key failure mode.

Findings

01

Enhanced compositional scene accuracy without increasing FID scores.

02

Fine-tuning a linear projection improves attribute-object binding.

03

Method requires only a small set of training pairs.

Abstract

Text-to-image diffusion-based generative models have the stunning ability to generate photo-realistic images and achieve state-of-the-art low FID scores on challenging image generation benchmarks. However, one of the primary failure modes of these text-to-image generative models is in composing attributes, objects, and their associated relationships accurately into an image. In our paper, we investigate compositional attribute binding failures, where the model fails to correctly associate descriptive attributes (such as color, shape, or texture) with the corresponding objects in the generated images, and highlight that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes. In particular, we show that (i) there exists an optimal text-embedding space that can generate highly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ArmanZarei/Mitigating-T2I-Comp-Issues
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Topic Modeling · Image Retrieval and Classification Techniques

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training