DPad: Efficient Diffusion Language Models with Suffix Dropout

Xinhua Chen; Sitao Huang; Cong Guo; Chiyue Wei; Yintao He; Jianyi Zhang; Hai "Helen" Li; Yiran Chen

arXiv:2508.14148·cs.CL·August 26, 2025

DPad: Efficient Diffusion Language Models with Suffix Dropout

Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai "Helen" Li, Yiran Chen

PDF

Open Access 3 Reviews

TL;DR

DPad is a training-free method that reduces the computational cost of diffusion-based language models by limiting attention to nearby suffix tokens, achieving significant speedups without sacrificing accuracy.

Contribution

DPad introduces a simple, effective, training-free approach combining a sliding window and distance-decay dropout to improve efficiency of diffusion language models.

Findings

01

Up to 61.4x speedup over vanilla dLLMs

02

Maintains comparable accuracy with reduced computation

03

Compatible with existing optimization techniques

Abstract

Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad (DPad), a training-free method that restricts attention to a small set of nearby suffix tokens, preserving fidelity while eliminating redundancy. DPad integrates two strategies: (i) a sliding window, which maintains a fixed-length suffix window, and (ii) distance-decay dropout, which deterministically removes distant suffix tokens before attention computation. This simple design is compatible with existing optimizations such as prefix caching and can be implemented with only a few lines of code. Comprehensive evaluations across multiple benchmarks on LLaDA-1.5 and Dream models demonstrate that DPad…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- Clean analysis of what happens with suffix tokens in block-wise autoregressive generation with MDLMs. - Well-motivated method for reduce unnecessary computation that can be applied at inference-time without changes to the model. - strong improvements in efficiency and good ablations showing the gaussian sampler which reflects the decay in attention scores actually helps.

Weaknesses

See questions.

Reviewer 02Rating 4Confidence 4

Strengths

1. The suffix window + decay dropout is lightweight, does not touch weights, and composes with parallel decoding and caching; the large compounding speedups in long sequences are compelling. 2. Attention/entropy analyses show strong distance decay and near-zero entropy far in the suffix; ablations identify a 64–128 token “critical window.” 3. The empirical studies across tasks with both reasoning and code are conducted, and tables separate latency vs TPS have discussed why TPS can look small

Weaknesses

1. Recent training-free accelerations for dLLMs cover Block-dLLM, Fast-dLLM (dLLM cache), Sparse-dLLM, etc. The paper cites some of them but does not always compare directly, especially on matched long-sequence regimes and memory usage. 2. The “distance-decay” sparsity concept parallels existing efforts in distance-biased or entropy-guided pruning methods. While DPad’s pre-attention pruning is a nice twist, the paper should sharpen how this differs theoretically/empirically from Sparse-dLLM (e

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper conducts a detailed analysis of suffix tokens in diffusion language models from a sequence-level perspective. It introduces a novel viewpoint for understanding suffix tokens, identifies three key properties. 2. The proposed **DPad** method is training-free; by applying distance-decay-style suffix dropout during inference, it significantly reduces computational complexity while maintaining model accuracy.

Weaknesses

1. Section 3.1 formally introduces the *scratchpad mechanism* and explains, from the perspective of attention-based information flow, that *‘the suffix serves as temporary memory to assist the ongoing denoising process’.* While this section highlights the potential ‘*Attention Connection’* role of suffix tokens, it does not evaluate how critical this effect is to model inference. No intervention experiments (e.g., masking parts of the suffix and observing changes in hidden states) are provided t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis