Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching

Fengrui Zuo; Zhiwei Ke; Yiming Liu; Wenqi Lou; Chao Wang; Xuehai Zhou

arXiv:2601.20332·cs.LG·February 3, 2026

Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching

Fengrui Zuo, Zhiwei Ke, Yiming Liu, Wenqi Lou, Chao Wang, Xuehai Zhou

PDF

Open Access

TL;DR

This paper introduces Window-Diffusion, a token pruning and caching technique that accelerates diffusion language model inference by exploiting structural locality, achieving up to 99x speedup with minimal performance loss.

Contribution

It proposes a window-based token pruning and caching method leveraging structural locality in DLMs, enabling significant inference speedup without retraining or constrained update orders.

Findings

01

Up to 99x inference speedup on LLaDA and Dream datasets.

02

Largely preserves generation quality with matched compute budgets.

03

Exploits local token influence to reduce redundant computation.

Abstract

Diffusion language models (DLMs) generate text through iterative denoising, but inference requires full-sequence attention at every iteration, resulting in substantial redundant computation on masked tokens. Block-wise diffusion can reduce this cost, yet it typically relies on retraining and constrained update orders, limiting its direct applicability to pretrained DLMs. Our token-level analysis reveals pronounced structural locality in DLM inference. Decoding is driven by a small set of prefix-localized active tokens; the influence of distant undecoded context diminishes rapidly, and decoded tokens exhibit stage-wise temporal stability, enabling reuse of intermediate representations except for a brief post-decode transient. Motivated by these observations, we propose \textbf{\placeholder}\footnote{The source code is available at https://github.com/vhicrgit/Window-Diffusion.}, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods