FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion
Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S. Abdelfattah, Jae-sun Seo, Zhiru Zhang, Udit Gupta

TL;DR
This paper introduces FlashDLM, a method that significantly accelerates diffusion language model inference by using KV caching and guided diffusion, achieving comparable or faster speeds than autoregressive models with minimal quality loss.
Contribution
The paper presents two training-free techniques, FreeCache and Guided Diffusion, to reduce inference time and complexity of diffusion language models without sacrificing quality.
Findings
Achieves 12.14x speedup in inference
Diffusion models match or surpass AR model latency
Minimal accuracy degradation observed
Abstract
Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized autoregressive (AR) models (e.g., Qwen2.5 7B, Llama3 8B), their iterative denoising requires multiple full-sequence forward passes, resulting in high computational costs and latency, particularly for long input prompts and long-context scenarios. Furthermore, parallel token generation introduces token incoherence problems, and current sampling heuristics suffer from significant quality drops with decreasing denoising steps. We address these limitations with two training-free techniques. First, we propose FreeCache, a Key-Value (KV) approximation caching technique…
Peer Reviews
Decision·ICLR 2026 Poster
### **Strengths of the Proposed Method** 1. **Training-Free Design for Off-the-Shelf Deployment** The core techniques (FreeCache and Guided Diffusion) require no additional training, fine-tuning, or dedicated calibration runs—they directly accelerate pre-trained DLMs "off the shelf". This avoids the overhead of retraining large models and significantly lowers the barrier to practical adoption. 2. **FreeCache: Targeted KV Caching to Cut Redundant Computation** FreeCache leverages a key i
### **Weaknesses and Restrictions** 1. **Limited Validation Across DLM Architectures and Scales** The method’s experiments are restricted to only two DLMs: Dream-7B-Instruct and LLaDA-8B-Instruct. It lacks testing on larger DLM variants (e.g., 13B/34B parameter DLMs) or domain-specific DLMs, making it unclear whether FreeCache’s KV stability assumption or Guided Diffusion’s AR supervision generalizes to more complex or specialized DLM structures. 2. **Sensitivity to the Quality and Domain
1) The method is training-free. 2) The experiments connect the stability observation to a concrete algorithm and show clear improvements in latency with small or no drops in accuracy. 3) The paper motivates FreeCache with a figure that shows stability across steps and positions. The description of block partitioning, active window recomputation, and progressive freezing is easy to follow. 4) The work targets a key bottleneck for DLMs in long-context reasoning: repeated full-sequence passes and p
1) The study centers on Dream-7B-Instruct and LLaDA-8B-Instruct with reasoning benchmarks. It is unclear how FreeCache scales to much larger DLMs or to domains such as dialogue safety, coding, or open-ended writing, where coherence needs may differ. Adding more domains or larger models would strengthen external validity. 2) Guided Diffusion depends on a longest-prefix acceptance rule with top-k AR logits and a confidence threshold τ. A deeper ablation on k, τ, and different AR guiders would clar
- The paper is clearly written, with a concise and logical explanation of the motivations, insights, and methodological details. - The experimental validation is thorough, including both accuracy benchmarks and detailed inference latency measurements. The demonstrated inference speedup is impressive and a strong contribution.
- While the overall inference improvement is significant, a detailed breakdown of the contribution from each component (FreeCache vs. Guided Diffusion) would be beneficial. An ablation study showing the speedup and accuracy impact of each technique individually would help readers understand their relative importance. - The AR-guided strategy is interesting. However, the analysis could be strengthened by including a discussion or experiment quantifying the inference overhead introduced by the AR
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsDiffusion
