Speculative Decoding: Performance or Illusion?

Xiaoxuan Liu; Jiaxiang Yu; Jongseok Park; Ion Stoica; Alvin Cheung

arXiv:2601.11580·cs.CL·March 19, 2026

Speculative Decoding: Performance or Illusion?

Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung

PDF

Open Access

TL;DR

This paper systematically evaluates speculative decoding (SD) for large language models on a production inference engine, revealing performance gaps and opportunities for improvement beyond prior prototype-based assessments.

Contribution

First comprehensive analysis of SD performance on a production-grade LLM inference engine across multiple variants and workloads, identifying key factors and theoretical bounds.

Findings

01

Verification dominates execution time

02

Acceptance length varies significantly

03

Observed performance often below theoretical upper bounds

Abstract

Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch sizes. We present, to our knowledge, the first systematic study of SD on a production-grade and widely deployed inference engine (vLLM), covering multiple SD variants ( $n$ -gram, EAGLE/EAGLE-3, Draft-Model, Multi-Token Prediction) across diverse workloads, model scales, and batch sizes. We analyze key factors governing SD performance, and quantify a theoretical upper bound on SD speedup. Our results show that verification by the target model dominates the execution, while acceptance length varies markedly across output token positions, requests, and datasets. Comparing measured performance with theoretical bounds reveals substantial gaps between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods