Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs

Junyi Wu; Tianchen Zhao; Shaoqiu Zhang; Linfeng Zhang; Guohao Dai; Yu Wang

arXiv:2605.18165·cs.LG·May 19, 2026

Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs

Junyi Wu, Tianchen Zhao, Shaoqiu Zhang, Linfeng Zhang, Guohao Dai, Yu Wang

PDF

TL;DR

This paper introduces a method to reduce computational redundancy in diffusion-based large language models by compressing [MASK] tokens, enabling faster decoding and better long-context handling.

Contribution

It proposes position-preserving [MASK] token compression and terminal-aware augmentation to accelerate decoding and improve context scaling in diffusion LLMs.

Findings

01

Redundant [MASK] token processing accounts for significant computational cost.

02

Compressed [MASK] tokens maintain structural information while reducing computation.

03

Augmentation with terminal [MASK] tokens enhances generation quality with minimal overhead.

Abstract

Unlike autoregressive models, which generate one token at a time, dLLMs denoise a chunk of [MASK] tokens jointly and sample one or more tokens per step; despite enabling parallel decoding, this process incurs substantial computational cost due to the large chunk size of masked tokens. We observe that much of this cost is spent on repeatedly processing the preceding context and many [MASK] tokens with the same feature representations, indicating considerable computational redundancy. In this work, we revisit dLLM's redundancy from the perspective of [MASK] tokens. Through systematic analysis, we verify the redundancy of [MASK] tokens while revealing their critical role in providing structural information. Guided by these findings, we propose position-preserving [MASK] token compression and terminal-aware augmentation. By compressing redundant [MASK] computation, this approach accelerates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.