Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

Rui Pan; Zhuofu Chen; Hongyi Liu; Arvind Krishnamurthy; Ravi Netravali

arXiv:2512.20573·cs.LG·January 29, 2026

Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

Rui Pan, Zhuofu Chen, Hongyi Liu, Arvind Krishnamurthy, Ravi Netravali

PDF

Open Access 3 Models

TL;DR

FailFast leverages diffusion LLMs in speculative decoding to dynamically adapt draft lengths, significantly accelerating autoregressive language models without fine-tuning and reducing verification costs.

Contribution

This paper introduces FailFast, a novel speculative decoding framework that uses diffusion LLMs to adaptively balance speed and quality, achieving substantial speedups without fine-tuning.

Findings

01

Up to 4.9× speedup over vanilla decoding

02

1.7× faster than naive diffusion LLM drafting

03

Effective across diverse models and workloads

Abstract

Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Generative Adversarial Networks and Image Synthesis