BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms
Yunlong Hou, Fengzhuo Zhang, Cunxiao Du, Xuan Zhang, Jiachun Pan, Tianyu Pang, Chao Du, Vincent Y. F. Tan, Zhuoran Yang

TL;DR
BanditSpec introduces a training-free, adaptive hyperparameter tuning method for speculative decoding in large language models, utilizing bandit algorithms to optimize performance dynamically during inference.
Contribution
It formulates hyperparameter selection as a multi-armed bandit problem and develops novel algorithms with theoretical regret bounds, enhancing decoding efficiency without additional training.
Findings
Effective hyperparameter adaptation with bandit algorithms
Theoretical regret bounds established for proposed methods
Empirical results show near-oracle performance in LLM inference
Abstract
Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Data Stream Mining Techniques
MethodsADaptive gradient method with the OPTimal convergence rate · ALIGN
