Compositional Image Synthesis with Inference-Time Scaling

Minsuk Ji; Sanghyeok Lee; and Namhyuk Ahn

arXiv:2510.24133·cs.CV·March 30, 2026

Compositional Image Synthesis with Inference-Time Scaling

Minsuk Ji, Sanghyeok Lee, and Namhyuk Ahn

PDF

1 Repo

TL;DR

This paper introduces a training-free, inference-time framework that enhances compositional accuracy in text-to-image synthesis by integrating explicit layouts and self-refinement, improving scene alignment with prompts.

Contribution

It combines large language models and vision-language models to generate and refine images based on explicit layouts without additional training.

Findings

01

Improved scene alignment with prompts over existing models.

02

Effective use of LLMs for explicit layout synthesis.

03

Self-refinement enhances compositional accuracy.

Abstract

Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gcl-inha/ReFocus
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.