CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback
Yixin Wan, Kai-Wei Chang

TL;DR
CompAlign introduces a challenging benchmark and a fine-grained evaluation framework to assess and improve the ability of text-to-image models to generate complex, compositional scenes with multiple objects and spatial relationships.
Contribution
The paper presents CompAlign, a new benchmark for complex compositional image generation, and CompQuest, an evaluation method that guides model improvements through detailed feedback.
Findings
Models struggle with complex 3D-spatial compositional tasks.
Open-source models lag behind commercial models in compositional accuracy.
Post-alignment, diffusion models show significant improvements in complex scene generation.
Abstract
State-of-the-art T2I models are capable of generating high-resolution images given textual prompts. However, they still struggle with accurately depicting compositional scenes that specify multiple objects, attributes, and spatial relations. We present CompAlign, a challenging benchmark with an emphasis on assessing the depiction of 3D-spatial relationships, for evaluating and improving models on compositional image generation. CompAlign consists of 900 complex multi-subject image generation prompts that combine numerical and 3D-spatial relationships with varied attribute bindings. Our benchmark is remarkably challenging, incorporating generation tasks with 3+ generation subjects with complex 3D-spatial relationships. Additionally, we propose CompQuest, an interpretable and accurate evaluation framework that decomposes complex prompts into atomic sub-questions, then utilizes a MLLM to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship
MethodsDiffusion
