dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching
Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, Linfeng Zhang

TL;DR
dLLM-Cache is a novel adaptive caching framework that significantly accelerates diffusion-based large language models by reusing computations, reducing inference latency without sacrificing output quality.
Contribution
It introduces a training-free, adaptive caching method tailored for diffusion LLMs, enabling efficient inference acceleration by leveraging static prompts and partial response stability.
Findings
Achieves up to 9.1x speedup over standard inference.
Reduces dLLM inference latency close to autoregressive models.
Maintains output quality while accelerating inference.
Abstract
Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is clearly written and well-structured, making the method easy to understand. 2. The adaptive caching mechanism is simple yet effective, requiring no retraining or architectural modification. 3. Comprehensive experiments demonstrate strong empirical improvements across multiple diffusion-based LLMs. 4. Ablation studies support the robustness of the cache selection mechanism.
1. The theoretical complexity reduction is not analyzed in sufficient depth. A more formal discussion or derivation of the speed–accuracy trade-off would greatly enhance clarity. 2. Since dLLM-Cache reuses outdated KV pairs, it would strengthen the work to mathematically and empirically analyze the upper bound of the approximation error under different values of K. 3. The limitations of the proposed method, particularly under dynamic or semantically diverse prompts, are not fully explored. A dee
1. The proposed V-Verify module effectively identifies and selects tokens for caching. 2. The method is well-motivated and demonstrates strong performance in accelerating dLLMs.
1. Contribution is limited The paper claims two main contributions: (i) adopting different update intervals for the prompt and response, and (ii) introducing V-verify to identify the most changed tokens for partial updates. However, the first contribution, also emphasized in the main analysis experiments (Section 3.2), has already been explored in prior works on KV-Cache for diffusion LLMs, such as dKV-Cache (the prefill part) and Fast-dLLM (PrefixCache). The other contribution is the V-verify
1. The paper addresses a critical and widely recognized problem: the high inference cost of diffusion LLMs. The achieved speedups are substantial and represent a major step towards making these models practical for real-world applications. 2. The V-verify mechanism is a clever, lightweight, and empirically-grounded solution for adaptively selecting which tokens to update, avoiding the overhead of more complex methods. 3. A key strength is that dLLM-Cache does not require any model retraining. T
1. The high-level diagrams provide a helpful overview of the caching workflow. However, the specific structure of the cache, particularly whether it is a single global entity or maintained on a per-layer basis, is not explicitly detailed. 2. The paper could better justify the necessity of caching four feature types (K, V, AttnOut, FFNOut), as this is a departure from the more standard KV-caching in ARMs. 3. It is unclear why some configurations, such as MMLU with LLaDA Instruct, benefit less fro
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Topic Modeling · Machine Learning in Healthcare
