TL;DR
This paper introduces a provably no-regret online model selection algorithm for speculative decoding in large language models, improving efficiency and accuracy over existing bandit-based methods across various datasets.
Contribution
It presents a novel algorithm that accurately evaluates all draft models without extra queries, outperforming bandit-based approaches and reducing computational overhead.
Findings
Our method outperforms EAGLE3 and BanditSpec baselines in diverse datasets.
The approach is applicable to various speculative decoding methods.
Experimental results show significant improvements in long reasoning tasks.
Abstract
Speculative decoding is widely used in accelerating large language model (LLM) inference. In this work, we focus on the online draft model selection problem in speculative decoding. We design an algorithm that provably competes with the best draft model in hindsight for each query in terms of either the token acceptance probability or expected acceptance length. In particular, we show that we can accurately evaluate all draft models, instead of only the chosen model without incurring additional queries to the target model, which allows us to improve exponentially over the existing bandit-based approach as the number of draft models increases. Our approach is generically applicable with any speculative decoding methods (single draft, multi-drafts and draft-trees). Moreover, we design system-efficient versions of online learners and demonstrate that the overhead in computation and latency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
