Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Siqi Kou; Jiachun Jin; Zetong Zhou; Ye Ma; Yugang Wang; Quan Chen; Peng Jiang; Xiao Yang; Jun Zhu; Kai Yu; Zhijie Deng

arXiv:2601.10332·cs.CV·January 16, 2026

Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Siqi Kou, Jiachun Jin, Zetong Zhou, Ye Ma, Yugang Wang, Quan Chen, Peng Jiang, Xiao Yang, Jun Zhu, Kai Yu, Zhijie Deng

PDF

Open Access

TL;DR

This paper introduces a reasoning-aware text-to-image diffusion framework that leverages LLMs for prompt rewriting and reasoning, significantly improving semantic accuracy and visual realism in generated images.

Contribution

It proposes the think-then-generate paradigm, enabling LLMs to reason about prompts before image synthesis, and introduces co-optimization with diffusion models for enhanced semantic fidelity.

Findings

01

Improved factual consistency and semantic alignment in generated images.

02

Achieved a WISE score of 0.79, close to GPT-4 performance.

03

Enhanced image editing capabilities with reasoning-aware prompts.

Abstract

Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship