Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization
Rahul Krishna Thomas, Arka Pal

TL;DR
This paper introduces a convex optimization approach to multi-draft speculative sampling for large language models, achieving high acceptance rates and low latency by efficiently solving the optimal transport problem.
Contribution
It reformulates the complex optimal transport problem into a convex optimization problem using polymatroid theory, enabling practical multi-draft sampling with high acceptance and efficiency.
Findings
Achieves 90% acceptance rate in multi-draft sampling.
Reduces overhead to under 100 ms per token.
Provides a scalable algorithm for optimal n-draft sampling.
Abstract
Speculative sampling reduces the latency of autoregressive decoding for target model LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample this token. To improve acceptance and decoding efficiency, recent work has explored the multi-draft extension, where at each step draft tokens are generated, and the verification criterion is a distribution conditioned on these. When this criterion maximizes the probability of accepting some draft token, it is called the optimal transport (OT). However, finding the OT is difficult, as it is the solution of a linear program (OTLP) in over variables, with being the vocabulary size. Two recent theoretical works have reframed the OTLP in terms of importance sampling or subset selection. In this work, we prove that these formulations are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Complexity and Algorithms in Graphs · Adversarial Robustness in Machine Learning
