Multi-Draft Speculative Sampling: Canonical Decomposition and Theoretical Limits
Ashish Khisti, M.Reza Ebrahimi, Hassan Dbouk, Arash Behboodi, Roland Memisevic, Christos Louizos

TL;DR
This paper introduces a new theoretical framework for multi-draft speculative sampling, decomposing the optimal sampling scheme into importance sampling and single-draft steps, with proven limits and improved efficiency.
Contribution
It presents a novel decomposition of the optimal multi-draft sampling scheme and provides theoretical conditions and explicit formulas for acceptance probabilities.
Findings
Decomposition into importance sampling and single-draft steps improves sampling efficiency.
Explicit acceptance probability formulas for identical draft models.
Experimental results show increased block efficiency and token rates.
Abstract
We consider multi-draft speculative sampling, where the proposal sequences are sampled independently from different draft models. At each step, a token-level draft selection scheme takes a list of valid tokens as input and produces an output token whose distribution matches that of the target model. Previous works have demonstrated that the optimal scheme (which maximizes the probability of accepting one of the input tokens) can be cast as a solution to a linear program. In this work we show that the optimal scheme can be decomposed into a two-step solution: in the first step an importance sampling (IS) type scheme is used to select one intermediate token; in the second step (single-draft) speculative sampling is applied to generate the output token. For the case of two identical draft models we further 1) establish a necessary and sufficient condition on the distributions of the target…
Peer Reviews
Decision·ICLR 2025 Spotlight
The results in the paper are quite original and significant, to my knowledge, which is a bit limited. The experiments show that the proposed algorithm works well in practice.
I would have liked the presentation in Section 4 to be a little clearer - request you to write out the full linear program, and then explain the truncation a bit more - what is the truncated program computing? In the experiment section, please include results for Temperature < 1, which is much more common in practice.
The paper is well written and easy to follow. Theorems 2 and 3 are novel and the experimental results are convincing.
See questions section below.
This paper provides a strong theoretical contribution to a very important topic. Speculative decoding serves as a refreshing and novel direction for accelerating LLM inference. This work puts the multi-draft model selection problem on better theoretical footing. The results seem to be sound from my reading and no obvious errors were found.
By and large the biggest complaint I have with this draft is the background exposition. Prior to reading this I was not familiar with speculative decoding. While I understand space is tight, many key concepts such as the role and form of accepting probability in the speculative decoder were not clearly explained. This is a *key* aspect of the work, and the draft would strongly benefit for giving it more treatment. To be honest, I couldn't understand anything the first time I read the paper and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Analysis with R · Statistical and Computational Modeling · Time Series Analysis and Forecasting
