Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control
Maria Mihaela Trusca, Wolf Nuyts, Jonathan Thomm, Robert Honig, Thomas, Hofmann, Tinne Tuytelaars, Marie-Francine Moens

TL;DR
This paper introduces a new evaluation model and control method for text-to-image generation, improving attribute-object binding accuracy by leveraging syntactic constraints and disentangled embeddings.
Contribution
It proposes EPViT for evaluating image-text alignment and focused cross-attention (FCA) for better attribute-object binding without retraining diffusion models.
Findings
EPViT effectively evaluates image-text alignment.
FCA improves attribute-object binding in T2I generation.
Significant performance gains on multiple datasets.
Abstract
Current diffusion models create photorealistic images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. This is evidenced by our novel image-graph alignment model called EPViT (Edge Prediction Vision Transformer) for the evaluation of image-text alignment. To alleviate the above problem, we propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. Additionally, the syntax structure of the prompt helps to disentangle the multimodal CLIP embeddings that are commonly used in T2I generation. The resulting DisCLIP embeddings and FCA are easily integrated in state-of-the-art diffusion models without additional training of these models. We show substantial improvements in T2I generation and especially its attribute-object binding on several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Scientific Computing and Data Management · Semantic Web and Ontologies
MethodsDiffusion · Contrastive Language-Image Pre-training
