SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Kaixuan Huang, Xudong Guo, Mengdi Wang

TL;DR
SpecDec++ enhances speculative decoding by adaptively selecting candidate lengths using a trained acceptance predictor, leading to significant speedups in large language model inference.
Contribution
It introduces an adaptive candidate length method for speculative decoding based on a theoretical threshold policy, improving inference speed over previous heuristics.
Findings
Achieves over 2x speedup on multiple datasets.
Outperforms baseline speculative decoding by 9-11%.
Validates the theoretical threshold policy for candidate selection.
Abstract
Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K -- the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment…
Peer Reviews
Decision·Submitted to ICLR 2025
**Theoretical Soundness:** The paper is theoretically grounded, providing a solid foundation for the proposed approach. **Ease of Implementation:** The method is straightforward to implement, which adds to its practical appeal and accessibility.
**Limited Practical Improvement:** The reported improvement over vanilla speculative decoding appears minimal, with only a 10% speedup for a large model. Moreover, vanilla speculative decoding is significantly slower than state-of-the-art methods like Medusa [1] and Eagle [2], raising questions about the method's practicality, especially given the additional complexity of training a candidate length prediction model. In terms of theoretical analysis, while the optimal strategy is presented as a
The paper solves an existing problem of draft token generation through a learning based method which is interesting as opposed to non-learning based heuristics or hyper-parameter search of fixed draft length. The motivation for an adaptive draft length is explained well shown with an ideal case example, with proper definitions to discard rate and verification rate. The training of the additional draft head for predicting draft token generation stop or continue is well explained.
Though the paper solves an important problem by a learning-based approach the overall gains seem small (7% - 11%) over vanilla speculative decoding. The paper has not compared with existing heuristic based adaptive draft length methods which use either draft entropy or other confidence scores for stopping or continuing the draft generation The hyper-parameter search required for finding the w_{reject} and probability threshold is an additional set of hassle also present in vanilla SpD, in th
1、Speculative decoding is of significant importance for improving the inference efficiency of large language models. 2、The paper provides sufficient justification in theoretical analysis by modeling the problem as an MDP and proposing a threshold policy, which offers a solid theoretical foundation for dynamically adjusting the candidate length. 3、By reducing the number of forward passes of the target model and decreasing the number of discarded tokens, SpecDec++ achieves faster inference speed
1、The paper aims to accelerate inference by reducing the number of forward passes of the target model and decreasing the number of discarded tokens, but the experiments show only a modest improvement over the baseline. Moreover, the training cost for the head is substantial. I wonder if these two aspects are not the key bottlenecks of speculative sampling, or if this method is not very effective in addressing these issues. This makes me somewhat skeptical about the effectiveness of this approach
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
