Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs
Junyi Wu, Tianchen Zhao, Shaoqiu Zhang, Linfeng Zhang, Guohao Dai, Yu Wang

TL;DR
This paper introduces a method to reduce computational redundancy in diffusion-based large language models by compressing [MASK] tokens, enabling faster decoding and better long-context handling.
Contribution
It proposes position-preserving [MASK] token compression and terminal-aware augmentation to accelerate decoding and improve context scaling in diffusion LLMs.
Findings
Redundant [MASK] token processing accounts for significant computational cost.
Compressed [MASK] tokens maintain structural information while reducing computation.
Augmentation with terminal [MASK] tokens enhances generation quality with minimal overhead.
Abstract
Unlike autoregressive models, which generate one token at a time, dLLMs denoise a chunk of [MASK] tokens jointly and sample one or more tokens per step; despite enabling parallel decoding, this process incurs substantial computational cost due to the large chunk size of masked tokens. We observe that much of this cost is spent on repeatedly processing the preceding context and many [MASK] tokens with the same feature representations, indicating considerable computational redundancy. In this work, we revisit dLLM's redundancy from the perspective of [MASK] tokens. Through systematic analysis, we verify the redundancy of [MASK] tokens while revealing their critical role in providing structural information. Guided by these findings, we propose position-preserving [MASK] token compression and terminal-aware augmentation. By compressing redundant [MASK] computation, this approach accelerates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
