Object-Attribute Binding in Text-to-Image Generation: Evaluation and   Control

Maria Mihaela Trusca; Wolf Nuyts; Jonathan Thomm; Robert Honig; Thomas; Hofmann; Tinne Tuytelaars; Marie-Francine Moens

arXiv:2404.13766·cs.CV·April 23, 2024

Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control

Maria Mihaela Trusca, Wolf Nuyts, Jonathan Thomm, Robert Honig, Thomas, Hofmann, Tinne Tuytelaars, Marie-Francine Moens

PDF

Open Access

TL;DR

This paper introduces a new evaluation model and control method for text-to-image generation, improving attribute-object binding accuracy by leveraging syntactic constraints and disentangled embeddings.

Contribution

It proposes EPViT for evaluating image-text alignment and focused cross-attention (FCA) for better attribute-object binding without retraining diffusion models.

Findings

01

EPViT effectively evaluates image-text alignment.

02

FCA improves attribute-object binding in T2I generation.

03

Significant performance gains on multiple datasets.

Abstract

Current diffusion models create photorealistic images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. This is evidenced by our novel image-graph alignment model called EPViT (Edge Prediction Vision Transformer) for the evaluation of image-text alignment. To alleviate the above problem, we propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. Additionally, the syntax structure of the prompt helps to disentangle the multimodal CLIP embeddings that are commonly used in T2I generation. The resulting DisCLIP embeddings and FCA are easily integrated in state-of-the-art diffusion models without additional training of these models. We show substantial improvements in T2I generation and especially its attribute-object binding on several…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Scientific Computing and Data Management · Semantic Web and Ontologies

MethodsDiffusion · Contrastive Language-Image Pre-training