AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models

Arman Zarei; Jiacheng Pan; Matthew Gwilliam; Soheil Feizi; Zhenheng Yang

arXiv:2512.09081·cs.CV·December 11, 2025

AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models

Arman Zarei, Jiacheng Pan, Matthew Gwilliam, Soheil Feizi, Zhenheng Yang

PDF

Open Access

TL;DR

AgentComp introduces an innovative training framework that enhances text-to-image models' ability to understand and generate complex, compositional prompts by explicitly differentiating similar compositions, leading to improved accuracy and generalization.

Contribution

The paper presents a novel method using large language models and agentic preference optimization to improve compositional reasoning in text-to-image models, achieving state-of-the-art results.

Findings

01

State-of-the-art on T2I-CompBench

02

Improved differentiation of similar compositions

03

Maintains image quality while enhancing reasoning

Abstract

Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionality $-$ accurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images, resulting in outputs that are close to the intended description yet deviate in fine-grained details. To address this, we propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations and enhance their reasoning ability. AgentComp leverages the reasoning and tool-use capabilities of large language models equipped with image generation, editing, and VQA tools to autonomously construct compositional datasets. Using these datasets, we apply an agentic preference optimization method to fine-tune…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Historical Architecture and Urbanism