EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Xuanlang Dai; Yujie Zhou; Long Xing; Jiazi Bu; Xilin Wei; Yuhong Liu; Beichen Zhang; Kai Chen; Yuhang Zang

arXiv:2603.12252·cs.CV·March 26, 2026

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang

PDF

Open Access 1 Models 1 Datasets

TL;DR

EndoCoT introduces an iterative reasoning framework that enhances multimodal language models' guidance in diffusion tasks, significantly improving accuracy on complex benchmarks by activating deeper reasoning and grounding it in textual supervision.

Contribution

The paper proposes EndoCoT, a novel framework that activates MLLMs' reasoning through iterative thought guidance and grounding, enabling step-by-step complex task solving in diffusion models.

Findings

01

Achieves 92.1% accuracy on diverse benchmarks.

02

Outperforms baseline by 8.3 percentage points.

03

Effectively activates reasoning in diffusion tasks.

Abstract

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
internlm/EndoCoT
model· 28 dl· ♡ 10
28 dl♡ 10

Datasets

internlm/EndoCoT-Data
dataset· 1.9k dl
1.9k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling