Iterative Refinement Improves Compositional Image Generation

Shantanu Jaiswal; Mihir Prabhudesai; Nikash Bhardwaj; Zheyang Qin; Amir Zadeh; Chuan Li; Katerina Fragkiadaki; Deepak Pathak

arXiv:2601.15286·cs.CV·January 22, 2026

Iterative Refinement Improves Compositional Image Generation

Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj, Zheyang Qin, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak

PDF

Open Access

TL;DR

This paper introduces an iterative refinement method for text-to-image models that improves the accuracy and faithfulness of generated images for complex prompts by using feedback from a vision-language critic, leading to consistent quantitative and qualitative improvements.

Contribution

The paper proposes a simple, flexible iterative refinement strategy guided by a vision-language model, significantly enhancing compositional image generation without external tools.

Findings

01

16.9% improvement on ConceptMix benchmark

02

13.8% improvement on T2I-CompBench

03

58.7% human preference for iterative refinement

Abstract

Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)