Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu; Hao Zhang; Shuchen Xue; Zhijian Liu; Shizhe Diao; Ligeng Zhu; Ping Luo; Song Han; Enze Xie

arXiv:2505.22618·cs.CL·July 4, 2025

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, Enze Xie

PDF

Open Access 3 Reviews

TL;DR

Fast-dLLM introduces a training-free method to accelerate diffusion-based large language models by enabling KV cache reuse and confidence-aware parallel decoding, significantly improving throughput while maintaining quality.

Contribution

It proposes a novel block-wise approximate KV Cache and confidence-aware decoding to enhance inference speed of diffusion LLMs without retraining.

Findings

01

Achieves up to 27.6× throughput improvement.

02

Maintains high generation quality with minimal accuracy loss.

03

Bridges the performance gap between diffusion and autoregressive models.

Abstract

Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. Timely and practically relevant: addresses the major efficiency bottleneck of diffusion-based LLMs, a direction gaining interest. 2. Training-free approach: requires no retraining or model modification, making it directly applicable to existing dLLMs. 3. Comprehensive evaluation: covers both text and multimodal reasoning tasks, with consistent gains across benchmarks. 4. Strong empirical results: large acceleration factors (up to 27.6x) with small degradation make the method attractive for

Weaknesses

1. Applicability to distilled or few-step diffusion LLMs. It remains unclear whether the proposed caching and confidence-aware decoding strategies would remain effective for distilled diffusion LLMs that operate with only a few or even a single denoising step (e.g., dParallel, arXiv:2509.26488; One-Step Diffusion LLM, OpenReview:P7OzWxOUHK). The reviewer acknowledge that these are concurrent works, while such aggressive timestep reduction is becoming a key trend, similar to continuous diffusion

Reviewer 02Rating 8Confidence 3

Strengths

1. Novel adaptation of KV Cache to bidirectional diffusion models via block-wise approximation, with insightful analysis showing high cosine similarity between adjacent steps. 2. Theoretical foundation with Theorem 1 proving the equivalence between greedy parallel and sequential decoding under certain conditions. 3. Comprehensive ablation studies covering key hyperparameters (block sizes, thresholds, generation lengths) and evaluation across models and benchmarks.

Weaknesses

The evaluation of models relies on only four benchmarks (GSM8K, MATH, HumanEval, MBPP for LLaDA), which primarily focus on math reasoning and code generation, missing important capability dimensions like commonsense reasoning (HellaSwag), factual knowledge retrieval (TriviaQA), and real-world code generation (LiveCodeBench, BigCodeBench) that would provide a more comprehensive understanding of the method's generalization and potential failure modes across diverse task types.

Reviewer 03Rating 8Confidence 5

Strengths

1.The paper introduces a novel, training-free framework that tackles two foundational challenges in dLLM inference. 2.The proposed methods are well-justified and supported by solid theory and experiment phenomenon. The approximate KV cache is empirically validated by the high similarity of KV activations in adjacent steps. The parallel decoding strategy is theoretically supported by Theorem 1, which proves the equivalence of greedy parallel and sequential decoding under high-confidence conditio

Weaknesses

The experimental evaluation is primarily limited to comparisons against the baseline dLLM pipelines. The paper lacks an comparison to other existing acceleration techniques for Diffusion LLMs.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion