Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks

Sergey Pankratov; Dan Alistarh

arXiv:2512.11718·cs.CL·December 15, 2025

Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks

Sergey Pankratov, Dan Alistarh

PDF

Open Access 1 Video

TL;DR

This paper establishes fundamental lower bounds on the speed of speculative decoding in large language models by modeling token generation as branching random walks, providing theoretical insights validated by empirical results.

Contribution

It introduces the first tight lower bounds on speculative decoding speed using branching random walk analysis, guiding future system design.

Findings

01

Expected tokens predicted per iteration bounded by a function of verifier capacity and entropy.

02

Theoretical bounds are validated by empirical experiments on Llama models.

03

Results reveal fundamental limits on parallel token generation efficiency.

Abstract

Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable speedup remain poorly understood. In this work, we establish the first ``tight'' lower bounds on the runtime of any deterministic speculative generation algorithm. This is achieved by drawing a parallel between the token generation process and branching random walks, which allows us to analyze the optimal draft tree selection problem. We prove, under basic assumptions, that the expected number of tokens successfully predicted per speculative iteration is bounded as $E [X] \leq (μ + μ_{(2)}) lo g (P) / μ^{2} + O (1)$ , where $P$ is the verifier's capacity, $μ$ is the expected entropy of the verifier's output distribution, and $μ_{(2)}$ is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks· underline

Taxonomy

TopicsNatural Language Processing Techniques · Machine Learning and Algorithms · Generative Adversarial Networks and Image Synthesis