A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models

Jan-Hendrik Koch; Jonas Krumme; Konrad Gadzicki

arXiv:2511.06888·cs.CV·November 12, 2025

A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models

Jan-Hendrik Koch, Jonas Krumme, Konrad Gadzicki

PDF

Open Access

TL;DR

This paper presents a two-stage system combining large language models and diffusion models to improve control over object layout and composition in text-to-image generation, achieving high object recall and layout fidelity.

Contribution

It introduces a novel two-stage approach that decomposes layout planning and image synthesis, enhancing control over object counts and spatial arrangements in generated images.

Findings

01

Object recall improved from 57.2% to 99.9% with task decomposition.

02

ControlNet preserves style but hallucinates objects; GLIGEN offers better layout fidelity.

03

End-to-end system generates images with specified objects and arrangements.

Abstract

Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements. This work introduces a two-stage system to address these compositional limitations. The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects. The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout. We find that task decomposition is critical for LLM-based spatial planning; by simplifying the initial generation to core objects and completing the layout with rule-based insertion, we improve object recall from 57.2% to 99.9% for complex scenes. For image synthesis, we compare two leading conditioning methods: ControlNet and GLIGEN. After domain-specific finetuning on table-setting datasets, we identify a key trade-off: ControlNet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning