DFlash: Block Diffusion for Flash Speculative Decoding
Jian Chen, Yesheng Liang, Zhijian Liu

TL;DR
DFlash introduces a block diffusion-based speculative decoding framework that significantly accelerates large language model inference by enabling parallel draft generation with high quality and acceptance rates.
Contribution
It presents a novel lightweight block diffusion model for speculative decoding, achieving higher speedups and draft quality compared to existing autoregressive methods.
Findings
Over 6x acceleration across various models and tasks
Up to 2.5x higher speedup than EAGLE-3
High-quality draft outputs with increased acceptance rates
Abstract
Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗z-lab/Qwen3.5-27B-DFlashmodel· 512 dl· ♡ 10512 dl♡ 10
- 🤗z-lab/Qwen3.5-4B-DFlashmodel· 920 dl· ♡ 7920 dl♡ 7
- 🤗z-lab/Qwen3-Coder-30B-A3B-DFlashmodel· 584 dl· ♡ 28584 dl♡ 28
- 🤗z-lab/Qwen3.5-35B-A3B-DFlashmodel· 1.1k dl· ♡ 121.1k dl♡ 12
- 🤗z-lab/Qwen3-8B-DFlash-b16model· 8.0k dl· ♡ 208.0k dl♡ 20
- 🤗z-lab/Qwen3-4B-DFlash-b16model· 21k dl· ♡ 2221k dl♡ 22
- 🤗z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChatmodel· 614 dl· ♡ 2614 dl♡ 2
- 🤗z-lab/gpt-oss-20b-DFlashmodel· 2.0k dl· ♡ 122.0k dl♡ 12
- 🤗z-lab/Qwen3-Coder-Next-DFlashmodel· 235 dl· ♡ 5235 dl♡ 5
- 🤗z-lab/gpt-oss-120b-DFlashmodel· 1.7k dl· ♡ 41.7k dl♡ 4
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods · Topic Modeling
