Towards Optimal Multi-draft Speculative Decoding
Zhengmian Hu, Tong Zheng, Vignesh Viswanathan, Ziyi Chen, Ryan A., Rossi, Yihan Wu, Dinesh Manocha, Heng Huang

TL;DR
This paper analyzes Multi-Draft Speculative Decoding for large language models, providing a theoretical framework to optimize acceptance rates and comparing different sampling methods to improve decoding efficiency.
Contribution
It introduces an efficient way to compute the optimal acceptance rate using the dual of an optimal transport problem and measures the efficiency gap of existing algorithms.
Findings
Sampling without replacement outperforms sampling with replacement.
Existing verification algorithms do not reach the theoretical efficiency upper bound.
Careful draft sampling design can improve decoding efficiency.
Abstract
Large Language Models (LLMs) have become an indispensable part of natural language processing tasks. However, autoregressive sampling has become an efficiency bottleneck. Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts, and the target LLM verifies them in parallel, ensuring that the final output conforms to the target model distribution. The two main design choices in MDSD are the draft sampling method and the verification algorithm. For a fixed draft sampling method, the optimal acceptance rate is a solution to an optimal transport problem, but the complexity of this problem makes it difficult to solve for the optimal acceptance rate and measure the gap between existing verification algorithms and the theoretical upper bound. This paper discusses the dual of the optimal transport problem,…
Peer Reviews
Decision·ICLR 2025 Poster
The paper makes progress on understanding the acceptance rate of Multi-Draft Speculative Sampling. The authors show a clever transformation of the transportation problem formulation of optimal acceptance rates to a subset selection problem, and then show an algorithm to solve the subset selection if the draft distribution satisfies certain properties. They then propose a new greedy Multi-Draft Speculative Sampling algorithm, which is closer to the optimal acceptance rate on some datasets. The
The paper takes a bit of effort to read, partly because of a lot of notation, and partly because of results that may not be familiar to a lot f readers. I am not sure if the authors can do much about this.
It is interesting to see that two existing verification approaches (K-Seq and the widely used RRS) can be unified as solving the same optimal transport problem coresponding to sampling without replacement $p_{draft}$. Therefore, they share the same upper bound. Table 1 shows that for a variety of models and settings, the two methods are close enough to the optimal acceptance rates. The proposed "greedy" draft generation approach and verification method is an interesting combination of the greed
### significance Much portion of the paper is dedicated to theoretical derivations of the optimal acceptance rate. However, the description and the development of the proposed algorithm is underplayed. The proposed methods deserve a proper name, clear demonstration of the verification algorithm (is the algorithm practical for $n>2 as compared to SpecHub?) and more thorough theoretical and experimental investigations to demonstrate the pros and cons compared with previous algorithms. ### cla
Paper is mathematically rigorous; it is relatively easy to follow and grasp new concepts. Authors do a good job of highlighting the drawbacks of previous work and offer solutions.
Some claims are optimistic, such as "the upper bound has never been computed before", I personally refrain from making such certain statements. Some contributions are minor, as an example, deriving the dual of an LP is not a contribution, yet it is claimed to be in the first bullet point of contributions. Although paper is mathematically mature, it borrows a lot from previous publications, in other words, novel theoretical contribution is minor.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy
