Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
Lei Zhang, Junjiao Tian, Zhipeng Fan, Kunpeng Li, Jialiang Wang, Weifeng Chen, Markos Georgopoulos, Felix Juefei-Xu, Yuxiang Bao, Julian McAuley, Manling Li, Zecheng He

TL;DR
This paper introduces a process-driven, multi-step image generation paradigm that mimics human incremental painting by iteratively planning, drafting, reflecting, and refining images grounded in evolving visual states.
Contribution
It proposes a novel multi-stage framework that explicitly models intermediate reasoning steps, improving interpretability and supervision in text-to-image generation.
Findings
The method achieves more coherent and grounded images compared to single-step models.
Step-wise supervision enhances the consistency and quality of intermediate visual states.
Experiments demonstrate improved performance across various text-to-image benchmarks.
Abstract
Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
