DFlash: Block Diffusion for Flash Speculative Decoding

Jian Chen; Yesheng Liang; Zhijian Liu

arXiv:2602.06036·cs.CL·February 6, 2026

DFlash: Block Diffusion for Flash Speculative Decoding

Jian Chen, Yesheng Liang, Zhijian Liu

PDF

Open Access 10 Models

TL;DR

DFlash introduces a block diffusion-based speculative decoding framework that significantly accelerates large language model inference by enabling parallel draft generation with high quality and acceptance rates.

Contribution

It presents a novel lightweight block diffusion model for speculative decoding, achieving higher speedups and draft quality compared to existing autoregressive methods.

Findings

01

Over 6x acceleration across various models and tasks

02

Up to 2.5x higher speedup than EAGLE-3

03

High-quality draft outputs with increased acceptance rates

Abstract

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods · Topic Modeling