dKV-Cache: The Cache for Diffusion Language Models

Xinyin Ma; Runpeng Yu; Gongfan Fang; Xinchao Wang

arXiv:2505.15781·cs.CL·May 22, 2025

dKV-Cache: The Cache for Diffusion Language Models

Xinyin Ma, Runpeng Yu, Gongfan Fang, Xinchao Wang

PDF

Open Access 2 Repos

TL;DR

This paper introduces dKV-Cache, a novel caching mechanism that significantly accelerates diffusion language models during inference, narrowing the gap with autoregressive models without retraining.

Contribution

The paper proposes a delayed and conditioned KV-cache-like mechanism for DLMs, enabling faster inference and improved performance, a novel approach for non-autoregressive models.

Findings

01

Achieves 2-10x inference speedup across benchmarks.

02

Provides almost lossless acceleration with improved long-sequence performance.

03

Enables training-free cache utilization in existing DLMs.

Abstract

Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need · Diffusion