Speculative Sampling with Reinforcement Learning
Chenan Wang, Daniel H. Shi, Haipeng Chen

TL;DR
This paper introduces Re-SpS, a reinforcement learning framework that dynamically optimizes speculative sampling hyperparameters in large language models, significantly improving inference speed without sacrificing output quality.
Contribution
Re-SpS is the first RL-based method for adaptive hyperparameter tuning in speculative sampling, enhancing efficiency across diverse contexts.
Findings
Achieves up to 5.45× speedup over backbone LLM.
Up to 1.12× speedup over SOTA EAGLE-3.
Maintains output fidelity across benchmarks.
Abstract
Inference time latency has remained an open challenge for real world applications of large language models (LLMs). State-of-the-art (SOTA) speculative sampling (SpS) methods for LLMs, like EAGLE-3, use tree-based drafting to explore multiple candidate continuations in parallel. However, the hyperparameters controlling the tree structure are static, which limits flexibility and efficiency across diverse contexts and domains. We introduce Reinforcement learning for Speculative Sampling (Re-SpS), the first reinforcement learning (RL)-based framework for draft tree hyperparameter optimization. Re-SpS dynamically adjusts draft tree hyperparameters in real-time, learning context-aware policies that maximize generation speed by balancing speculative aggression with computational overhead. It leverages efficient state representations from target model hidden states and introduces multi-step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques
