Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

Lei Zhang; Junjiao Tian; Zhipeng Fan; Kunpeng Li; Jialiang Wang; Weifeng Chen; Markos Georgopoulos; Felix Juefei-Xu; Yuxiang Bao; Julian McAuley; Manling Li; Zecheng He

arXiv:2604.04746·cs.CV·April 9, 2026

Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

Lei Zhang, Junjiao Tian, Zhipeng Fan, Kunpeng Li, Jialiang Wang, Weifeng Chen, Markos Georgopoulos, Felix Juefei-Xu, Yuxiang Bao, Julian McAuley, Manling Li, Zecheng He

PDF

TL;DR

This paper introduces a process-driven, multi-step image generation paradigm that mimics human incremental painting by iteratively planning, drafting, reflecting, and refining images grounded in evolving visual states.

Contribution

It proposes a novel multi-stage framework that explicitly models intermediate reasoning steps, improving interpretability and supervision in text-to-image generation.

Findings

01

The method achieves more coherent and grounded images compared to single-step models.

02

Step-wise supervision enhances the consistency and quality of intermediate visual states.

03

Experiments demonstrate improved performance across various text-to-image benchmarks.

Abstract

Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.