CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback

Yixin Wan; Kai-Wei Chang

arXiv:2505.11178·cs.CV·May 19, 2025

CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback

Yixin Wan, Kai-Wei Chang

PDF

Open Access

TL;DR

CompAlign introduces a challenging benchmark and a fine-grained evaluation framework to assess and improve the ability of text-to-image models to generate complex, compositional scenes with multiple objects and spatial relationships.

Contribution

The paper presents CompAlign, a new benchmark for complex compositional image generation, and CompQuest, an evaluation method that guides model improvements through detailed feedback.

Findings

01

Models struggle with complex 3D-spatial compositional tasks.

02

Open-source models lag behind commercial models in compositional accuracy.

03

Post-alignment, diffusion models show significant improvements in complex scene generation.

Abstract

State-of-the-art T2I models are capable of generating high-resolution images given textual prompts. However, they still struggle with accurately depicting compositional scenes that specify multiple objects, attributes, and spatial relations. We present CompAlign, a challenging benchmark with an emphasis on assessing the depiction of 3D-spatial relationships, for evaluating and improving models on compositional image generation. CompAlign consists of 900 complex multi-subject image generation prompts that combine numerical and 3D-spatial relationships with varied attribute bindings. Our benchmark is remarkably challenging, incorporating generation tasks with 3+ generation subjects with complex 3D-spatial relationships. Additionally, we propose CompQuest, an interpretable and accurate evaluation framework that decomposes complex prompts into atomic sub-questions, then utilizes a MLLM to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship

MethodsDiffusion