Multi-Draft Speculative Sampling: Canonical Decomposition and Theoretical Limits

Ashish Khisti; M.Reza Ebrahimi; Hassan Dbouk; Arash Behboodi; Roland Memisevic; Christos Louizos

arXiv:2410.18234·cs.CL·May 12, 2025

Multi-Draft Speculative Sampling: Canonical Decomposition and Theoretical Limits

Ashish Khisti, M.Reza Ebrahimi, Hassan Dbouk, Arash Behboodi, Roland Memisevic, Christos Louizos

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new theoretical framework for multi-draft speculative sampling, decomposing the optimal sampling scheme into importance sampling and single-draft steps, with proven limits and improved efficiency.

Contribution

It presents a novel decomposition of the optimal multi-draft sampling scheme and provides theoretical conditions and explicit formulas for acceptance probabilities.

Findings

01

Decomposition into importance sampling and single-draft steps improves sampling efficiency.

02

Explicit acceptance probability formulas for identical draft models.

03

Experimental results show increased block efficiency and token rates.

Abstract

We consider multi-draft speculative sampling, where the proposal sequences are sampled independently from different draft models. At each step, a token-level draft selection scheme takes a list of valid tokens as input and produces an output token whose distribution matches that of the target model. Previous works have demonstrated that the optimal scheme (which maximizes the probability of accepting one of the input tokens) can be cast as a solution to a linear program. In this work we show that the optimal scheme can be decomposed into a two-step solution: in the first step an importance sampling (IS) type scheme is used to select one intermediate token; in the second step (single-draft) speculative sampling is applied to generate the output token. For the case of two identical draft models we further 1) establish a necessary and sufficient condition on the distributions of the target…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 3

Strengths

The results in the paper are quite original and significant, to my knowledge, which is a bit limited. The experiments show that the proposed algorithm works well in practice.

Weaknesses

I would have liked the presentation in Section 4 to be a little clearer - request you to write out the full linear program, and then explain the truncation a bit more - what is the truncated program computing? In the experiment section, please include results for Temperature < 1, which is much more common in practice.

Reviewer 02Rating 8Confidence 4

Strengths

The paper is well written and easy to follow. Theorems 2 and 3 are novel and the experimental results are convincing.

Weaknesses

See questions section below.

Reviewer 03Rating 8Confidence 2

Strengths

This paper provides a strong theoretical contribution to a very important topic. Speculative decoding serves as a refreshing and novel direction for accelerating LLM inference. This work puts the multi-draft model selection problem on better theoretical footing. The results seem to be sound from my reading and no obvious errors were found.

Weaknesses

By and large the biggest complaint I have with this draft is the background exposition. Prior to reading this I was not familiar with speculative decoding. While I understand space is tight, many key concepts such as the role and form of accepting probability in the speculative decoder were not clearly explained. This is a *key* aspect of the work, and the draft would strongly benefit for giving it more treatment. To be honest, I couldn't understand anything the first time I read the paper and

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Analysis with R · Statistical and Computational Modeling · Time Series Analysis and Forecasting