Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks
Sergey Pankratov, Dan Alistarh

TL;DR
This paper establishes fundamental lower bounds on the speed of speculative decoding in large language models by modeling token generation as branching random walks, providing theoretical insights validated by empirical results.
Contribution
It introduces the first tight lower bounds on speculative decoding speed using branching random walk analysis, guiding future system design.
Findings
Expected tokens predicted per iteration bounded by a function of verifier capacity and entropy.
Theoretical bounds are validated by empirical experiments on Llama models.
Results reveal fundamental limits on parallel token generation efficiency.
Abstract
Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable speedup remain poorly understood. In this work, we establish the first ``tight'' lower bounds on the runtime of any deterministic speculative generation algorithm. This is achieved by drawing a parallel between the token generation process and branching random walks, which allows us to analyze the optimal draft tree selection problem. We prove, under basic assumptions, that the expected number of tokens successfully predicted per speculative iteration is bounded as , where is the verifier's capacity, is the expected entropy of the verifier's output distribution, and is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Machine Learning and Algorithms · Generative Adversarial Networks and Image Synthesis
