DPad: Efficient Diffusion Language Models with Suffix Dropout
Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai "Helen" Li, Yiran Chen

TL;DR
DPad is a training-free method that reduces the computational cost of diffusion-based language models by limiting attention to nearby suffix tokens, achieving significant speedups without sacrificing accuracy.
Contribution
DPad introduces a simple, effective, training-free approach combining a sliding window and distance-decay dropout to improve efficiency of diffusion language models.
Findings
Up to 61.4x speedup over vanilla dLLMs
Maintains comparable accuracy with reduced computation
Compatible with existing optimization techniques
Abstract
Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad (DPad), a training-free method that restricts attention to a small set of nearby suffix tokens, preserving fidelity while eliminating redundancy. DPad integrates two strategies: (i) a sliding window, which maintains a fixed-length suffix window, and (ii) distance-decay dropout, which deterministically removes distant suffix tokens before attention computation. This simple design is compatible with existing optimizations such as prefix caching and can be implemented with only a few lines of code. Comprehensive evaluations across multiple benchmarks on LLaDA-1.5 and Dream models demonstrate that DPad…
Peer Reviews
Decision·ICLR 2026 Poster
- Clean analysis of what happens with suffix tokens in block-wise autoregressive generation with MDLMs. - Well-motivated method for reduce unnecessary computation that can be applied at inference-time without changes to the model. - strong improvements in efficiency and good ablations showing the gaussian sampler which reflects the decay in attention scores actually helps.
See questions.
1. The suffix window + decay dropout is lightweight, does not touch weights, and composes with parallel decoding and caching; the large compounding speedups in long sequences are compelling. 2. Attention/entropy analyses show strong distance decay and near-zero entropy far in the suffix; ablations identify a 64–128 token “critical window.” 3. The empirical studies across tasks with both reasoning and code are conducted, and tables separate latency vs TPS have discussed why TPS can look small
1. Recent training-free accelerations for dLLMs cover Block-dLLM, Fast-dLLM (dLLM cache), Sparse-dLLM, etc. The paper cites some of them but does not always compare directly, especially on matched long-sequence regimes and memory usage. 2. The “distance-decay” sparsity concept parallels existing efforts in distance-biased or entropy-guided pruning methods. While DPad’s pre-attention pruning is a nice twist, the paper should sharpen how this differs theoretically/empirically from Sparse-dLLM (e
1. The paper conducts a detailed analysis of suffix tokens in diffusion language models from a sequence-level perspective. It introduces a novel viewpoint for understanding suffix tokens, identifies three key properties. 2. The proposed **DPad** method is training-free; by applying distance-decay-style suffix dropout during inference, it significantly reduces computational complexity while maintaining model accuracy.
1. Section 3.1 formally introduces the *scratchpad mechanism* and explains, from the perspective of attention-based information flow, that *‘the suffix serves as temporary memory to assist the ongoing denoising process’.* While this section highlights the potential ‘*Attention Connection’* role of suffix tokens, it does not evaluate how critical this effect is to model inference. No intervention experiments (e.g., masking parts of the suffix and observing changes in hidden states) are provided t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
