Out-of-Vocabulary Sampling Boosts Speculative Decoding
Nadav Timor, Jonathan Mamou, Oren Pereg, Hongyang Zhang, David Harel

TL;DR
This paper introduces Redistributing Drafter Kernels (RDK), a novel out-of-vocabulary sampler that improves speculative decoding efficiency by enabling highly pruned vocabularies with higher acceptance rates and reduced computational complexity.
Contribution
The paper presents RDK, the first out-of-vocabulary sampler that effectively restores acceptance rates and reduces redistribution time complexity, enabling highly pruned language model drafters.
Findings
RDK achieves higher acceptance rates than existing samplers.
RDK reduces redistribution complexity from O(N^2) to O(N).
RDK enables extremely pruned vocabularies with maintained efficiency.
Abstract
Speculative decoding relies on fast and accurate drafters. Recent state-of-the-art language models employ larger and larger vocabularies, which significantly slows down drafters. One promising approach to boost the efficiency of speculative decoding is to use drafters with smaller vocabularies. However, existing sampling methods cannot draw out-of-vocabulary tokens, creating a tradeoff between drafters' vocabulary size and acceptance rates. This paper introduces Redistributing Drafter Kernels (RDK), the first out-of-vocabulary sampler that effectively recovers acceptance rates by virtually restoring pruned target tokens. RDK leverages token-affinity priors to reallocate drafter mass towards high-overlap regions. We prove mathematically that RDK can achieve higher acceptance rates than vanilla and state-of-the-art samplers. We provide an efficient first-order approximation of RDK and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms
MethodsPruning
