AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models
Arman Zarei, Jiacheng Pan, Matthew Gwilliam, Soheil Feizi, Zhenheng Yang

TL;DR
AgentComp introduces an innovative training framework that enhances text-to-image models' ability to understand and generate complex, compositional prompts by explicitly differentiating similar compositions, leading to improved accuracy and generalization.
Contribution
The paper presents a novel method using large language models and agentic preference optimization to improve compositional reasoning in text-to-image models, achieving state-of-the-art results.
Findings
State-of-the-art on T2I-CompBench
Improved differentiation of similar compositions
Maintains image quality while enhancing reasoning
Abstract
Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionalityaccurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images, resulting in outputs that are close to the intended description yet deviate in fine-grained details. To address this, we propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations and enhance their reasoning ability. AgentComp leverages the reasoning and tool-use capabilities of large language models equipped with image generation, editing, and VQA tools to autonomously construct compositional datasets. Using these datasets, we apply an agentic preference optimization method to fine-tune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Historical Architecture and Urbanism
