CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step

Zheyuan Liu; Munan Ning; Qihui Zhang; Shuo Yang; Zhongrui Wang; Yiwei Yang; Xianzhe Xu; Yibing Song; Weihua Chen; Fan Wang; Li Yuan

arXiv:2507.04451·cs.CV·July 8, 2025

CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step

Zheyuan Liu, Munan Ning, Qihui Zhang, Shuo Yang, Zhongrui Wang, Yiwei Yang, Xianzhe Xu, Yibing Song, Weihua Chen, Fan Wang, Li Yuan

PDF

TL;DR

CoT-Diff introduces a step-by-step reasoning framework that integrates 3D layout planning with diffusion models, significantly enhancing spatial accuracy and compositional fidelity in text-to-image generation for complex scenes.

Contribution

This work presents a novel framework that tightly couples MLLM-driven 3D layout planning with diffusion models, enabling dynamic, layout-aware reasoning during image synthesis.

Findings

01

Improves spatial alignment and compositional fidelity in T2I generation.

02

Outperforms state-of-the-art by 34.7% in complex scene spatial accuracy.

03

Enables precise spatial control through integrated layout updates.

Abstract

Current text-to-image (T2I) generation models struggle to align spatial composition with the input text, especially in complex scenes. Even layout-based approaches yield suboptimal spatial control, as their generation process is decoupled from layout planning, making it difficult to refine the layout during synthesis. We present CoT-Diff, a framework that brings step-by-step CoT-style reasoning into T2I generation by tightly integrating Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process. CoT-Diff enables layout-aware reasoning inline within a single diffusion round: at each denoising step, the MLLM evaluates intermediate predictions, dynamically updates the 3D scene layout, and continuously guides the generation process. The updated layout is converted into semantic conditions and depth maps, which are fused into the diffusion model via a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN · Diffusion