From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs

Yuchuan Tian; Yuchen Liang; Shuo Zhang; Yingte Shu; Guangwen Yang; Wei He; Sibo Fang; Tianyu Guo; Kai Han; Chao Xu; Hanting Chen; Xinghao Chen; Yunhe Wang

arXiv:2512.06776·cs.CL·February 2, 2026

From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs

Yuchuan Tian, Yuchen Liang, Shuo Zhang, Yingte Shu, Guangwen Yang, Wei He, Sibo Fang, Tianyu Guo, Kai Han, Chao Xu, Hanting Chen, Xinghao Chen, Yunhe Wang

PDF

Open Access

TL;DR

This paper introduces a principled method for adapting Auto-Regressive models into Diffusion Language Models by gradually transitioning block sizes, resulting in improved long-context generation and state-of-the-art performance.

Contribution

It proposes a novel adaptation pathway from AR to DLM using a block-diffusion framework with gradual block size increase, enhancing long-context capabilities.

Findings

01

Achieves state-of-the-art results among 7B-class DLMs.

02

Demonstrates effective long-context modeling and reasoning.

03

Proves the adaptation method is competitive across various model scales.

Abstract

Diffusion Language Models (DLMs) enable fast generation, yet training large DLMs from scratch is costly. As a practical shortcut, adapting off-the-shelf Auto-Regressive (AR) model weights into a DLM could quickly equip the DLM with strong long-context generation capabilies. Prior "adaptation" attempts either modify logits or randomly grow attention masks to Full-Sequence diffusion, or simply transplant AR weights into a Block-Diffusion recipe, leaving two key questions unaddressed: where is the final destination of adaptation, and how to adapt better? For manifold benefits, we reframe the whole AR-to-DLM adaptation under the Block-Diffusion paradigm, transitioning from block size 1 to the final Block-Diffusion state. Concretely, the principled pathway of adaptation is designed as follows: we keep a context-causal path where causal attention is kept in the prefix, an efficient parallel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Domain Adaptation and Few-Shot Learning