Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, Enze Xie

TL;DR
Fast-dLLM introduces a training-free method to accelerate diffusion-based large language models by enabling KV cache reuse and confidence-aware parallel decoding, significantly improving throughput while maintaining quality.
Contribution
It proposes a novel block-wise approximate KV Cache and confidence-aware decoding to enhance inference speed of diffusion LLMs without retraining.
Findings
Achieves up to 27.6× throughput improvement.
Maintains high generation quality with minimal accuracy loss.
Bridges the performance gap between diffusion and autoregressive models.
Abstract
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency…
Peer Reviews
Decision·ICLR 2026 Poster
1. Timely and practically relevant: addresses the major efficiency bottleneck of diffusion-based LLMs, a direction gaining interest. 2. Training-free approach: requires no retraining or model modification, making it directly applicable to existing dLLMs. 3. Comprehensive evaluation: covers both text and multimodal reasoning tasks, with consistent gains across benchmarks. 4. Strong empirical results: large acceleration factors (up to 27.6x) with small degradation make the method attractive for
1. Applicability to distilled or few-step diffusion LLMs. It remains unclear whether the proposed caching and confidence-aware decoding strategies would remain effective for distilled diffusion LLMs that operate with only a few or even a single denoising step (e.g., dParallel, arXiv:2509.26488; One-Step Diffusion LLM, OpenReview:P7OzWxOUHK). The reviewer acknowledge that these are concurrent works, while such aggressive timestep reduction is becoming a key trend, similar to continuous diffusion
1. Novel adaptation of KV Cache to bidirectional diffusion models via block-wise approximation, with insightful analysis showing high cosine similarity between adjacent steps. 2. Theoretical foundation with Theorem 1 proving the equivalence between greedy parallel and sequential decoding under certain conditions. 3. Comprehensive ablation studies covering key hyperparameters (block sizes, thresholds, generation lengths) and evaluation across models and benchmarks.
The evaluation of models relies on only four benchmarks (GSM8K, MATH, HumanEval, MBPP for LLaDA), which primarily focus on math reasoning and code generation, missing important capability dimensions like commonsense reasoning (HellaSwag), factual knowledge retrieval (TriviaQA), and real-world code generation (LiveCodeBench, BigCodeBench) that would provide a more comprehensive understanding of the method's generalization and potential failure modes across diverse task types.
1.The paper introduces a novel, training-free framework that tackles two foundational challenges in dLLM inference. 2.The proposed methods are well-justified and supported by solid theory and experiment phenomenon. The approximate KV cache is empirically validated by the high similarity of KV activations in adjacent steps. The parallel decoding strategy is theoretically supported by Theorem 1, which proves the equivalence of greedy parallel and sequential decoding under high-confidence conditio
The experimental evaluation is primarily limited to comparisons against the baseline dLLM pipelines. The paper lacks an comparison to other existing acceleration techniques for Diffusion LLMs.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
