Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

Ziyu Guo; Renrui Zhang; Hongyu Li; Manyuan Zhang; Xinyan Chen; Sifan Wang; Yan Feng; Peng Pei; Pheng-Ann Heng

arXiv:2511.16671·cs.CV·November 21, 2025

Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, Pheng-Ann Heng

PDF

Open Access

TL;DR

This paper introduces TwiG, a novel framework that interleaves textual reasoning during visual content generation, leading to more context-aware and semantically rich images through dynamic, on-the-fly multimodal interaction.

Contribution

It presents the first interleaved approach enabling textual reasoning to guide and reflect during visual generation, with strategies including zero-shot prompting, supervised fine-tuning, and reinforcement learning.

Findings

01

Interleaved reasoning improves visual output quality.

02

Multiple strategies demonstrate the framework's versatility.

03

Code will be publicly available for further research.

Abstract

Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Artificial Intelligence in Games