ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping
Zijian Zhu, Fei Ren, Zhanhong Tan, Kaisheng Ma

TL;DR
This paper introduces ES-dLLM, a training-free method that accelerates diffusion large language model inference by early token skipping based on intermediate tensor variation, achieving significant speedups without quality loss.
Contribution
The paper proposes a novel inference acceleration framework for dLLMs that leverages subtle changes in intermediate representations to skip tokens early, reducing computation.
Findings
Achieves up to 16.8x speedup over vanilla dLLM inference.
Maintains generation quality while significantly increasing throughput.
Outperforms state-of-the-art caching methods in speed.
Abstract
Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM inference remains computationally expensive as the full input context is processed at every iteration. In this work, we analyze the generation dynamics of dLLMs and find that intermediate representations, including key, value, and hidden states, change only subtly across successive iterations. Leveraging this insight, we propose \textbf{ES-dLLM}, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens in early layers based on the estimated importance. Token importance is computed with intermediate tensor variation and confidence scores of previous iterations. Experiments on LLaDA-8B and Dream-7B demonstrate…
Peer Reviews
Decision·ICLR 2026 Poster
- This work propose a method well-substantiated by empirical findings. - Consistent and significant speedup across all the datasets tested.
- The experiment session only compared to two other methods. Are there any other methods addressing the same problem?
+ **Training-free design for easy deployment**: ES-dLLM requires no model fine-tuning or structural modification. It only optimizes the inference process through early-skipping and partial cache updates, which can be directly integrated into open-source dLLMs (e.g., LLaDA, Dream) without reconstructing the underlying framework.
+ **Fixed heuristic for importance score, lacking adaptability**:The importance score of ES-dLLM relies on a fixed linear weighting of prior confidence and intermediate tensor variation (default α=0.5), without considering task-specific differences. Ablation experiments show that using only tensor variation (α=0) performs better on the MATH dataset than the default α=0.5, while relying solely on confidence (α=1) leads to noticeable quality degradation. However, ES-dLLM does not propose a dynamic
The motivation is strong that the hidden-state variation statistics convincingly demonstrate redundancy. We believe it has been observed in diffusion video generation models and image generation models [1], but it is new in DLLM. This paper clearly demonstrates its motivation. I like that. Compromising generation quality is important for real-world applications, which helps a lot for this paper. We believe training-free is essential for effective inference, which this paper achieves. [1] Sil
Novelty risk: This method appears to be an extension of the original implementation and DualCache. Fortunately, DualCache is not a sparsity work, which distinguishes the two works, but we hope to see more discussion on the degree of differentiation. Also, as we claimed in the Strengths, a similar idea has been proposed in the traditional diffusion model. Comparison and related work: The paper compares only with the original method and DualCache, limiting the scope of the comparison. However, t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Generative Adversarial Networks and Image Synthesis
