Towards Optimal Multi-draft Speculative Decoding

Zhengmian Hu; Tong Zheng; Vignesh Viswanathan; Ziyi Chen; Ryan A.; Rossi; Yihan Wu; Dinesh Manocha; Heng Huang

arXiv:2502.18779·cs.CL·February 27, 2025

Towards Optimal Multi-draft Speculative Decoding

Zhengmian Hu, Tong Zheng, Vignesh Viswanathan, Ziyi Chen, Ryan A., Rossi, Yihan Wu, Dinesh Manocha, Heng Huang

PDF

Open Access 3 Reviews

TL;DR

This paper analyzes Multi-Draft Speculative Decoding for large language models, providing a theoretical framework to optimize acceptance rates and comparing different sampling methods to improve decoding efficiency.

Contribution

It introduces an efficient way to compute the optimal acceptance rate using the dual of an optimal transport problem and measures the efficiency gap of existing algorithms.

Findings

01

Sampling without replacement outperforms sampling with replacement.

02

Existing verification algorithms do not reach the theoretical efficiency upper bound.

03

Careful draft sampling design can improve decoding efficiency.

Abstract

Large Language Models (LLMs) have become an indispensable part of natural language processing tasks. However, autoregressive sampling has become an efficiency bottleneck. Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts, and the target LLM verifies them in parallel, ensuring that the final output conforms to the target model distribution. The two main design choices in MDSD are the draft sampling method and the verification algorithm. For a fixed draft sampling method, the optimal acceptance rate is a solution to an optimal transport problem, but the complexity of this problem makes it difficult to solve for the optimal acceptance rate and measure the gap between existing verification algorithms and the theoretical upper bound. This paper discusses the dual of the optimal transport problem,…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

The paper makes progress on understanding the acceptance rate of Multi-Draft Speculative Sampling. The authors show a clever transformation of the transportation problem formulation of optimal acceptance rates to a subset selection problem, and then show an algorithm to solve the subset selection if the draft distribution satisfies certain properties. They then propose a new greedy Multi-Draft Speculative Sampling algorithm, which is closer to the optimal acceptance rate on some datasets. The

Weaknesses

The paper takes a bit of effort to read, partly because of a lot of notation, and partly because of results that may not be familiar to a lot f readers. I am not sure if the authors can do much about this.

Reviewer 02Rating 3Confidence 4

Strengths

It is interesting to see that two existing verification approaches (K-Seq and the widely used RRS) can be unified as solving the same optimal transport problem coresponding to sampling without replacement $p_{draft}$. Therefore, they share the same upper bound. Table 1 shows that for a variety of models and settings, the two methods are close enough to the optimal acceptance rates. The proposed "greedy" draft generation approach and verification method is an interesting combination of the greed

Weaknesses

### significance Much portion of the paper is dedicated to theoretical derivations of the optimal acceptance rate. However, the description and the development of the proposed algorithm is underplayed. The proposed methods deserve a proper name, clear demonstration of the verification algorithm (is the algorithm practical for $n>2 as compared to SpecHub?) and more thorough theoretical and experimental investigations to demonstrate the pros and cons compared with previous algorithms. ### cla

Reviewer 03Rating 5Confidence 3

Strengths

Paper is mathematically rigorous; it is relatively easy to follow and grasp new concepts. Authors do a good job of highlighting the drawbacks of previous work and offer solutions.

Weaknesses

Some claims are optimistic, such as "the upper bound has never been computed before", I personally refrain from making such certain statements. Some contributions are minor, as an example, deriving the dual of an LP is not a contribution, yet it is claimed to be in the first bullet point of contributions. Although paper is mathematically mature, it borrows a lot from previous publications, in other words, novel theoretical contribution is minor.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy