Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning
Jiebin Zhang, Zhenghan Yu, Liang Wang, Nan Yang, Eugene J. Yu, Zheng Li, Yifan Song, Dawei Zhu, Xingxing Zhang, Furu Wei, Sujian Li

TL;DR
This paper introduces Learning to Draft (LTD), a reinforcement learning-based method that dynamically optimizes speculative decoding for large language models, significantly improving inference speed over existing static and proxy-based approaches.
Contribution
LTD is the first approach to directly optimize decoding throughput in speculative decoding using reinforcement learning with co-adaptive policies.
Findings
Achieves speedup ratios from 2.24x to 4.32x across models and tasks.
Outperforms the state-of-the-art Eagle3 by up to 36.4%.
Demonstrates effective adaptation of draft and verification policies.
Abstract
Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed method LTD directly optimizes the practical system throughput ($L_A / T_{total}$), i.e., the number of accepted tokens divided by the total time. This ensures the strategy is genuinely maximizing inference speed while prior works mostly focused on indirect metrics like acceptance length. 2. The method uses a Reinforcement Learning (RL) environment to train two dynamic, co-adaptive policies: one for the draft tree's depth and one for the verification size. This dynamic approach
1. Training the adaptive policies requires setting up and running a complex RL environment. This adds significant computational overhead and complexity during the training phase, which is a major barrier to adoption compared to simpler, non-adaptive speculative decoding methods. 2. The core of LTD relies on accurately modeling the time cost of both the drafting and verification phases to compute the throughput objective ($T_{total}$). If the real-world environment introduces variances or non-li
- The paper formalizes speculative decoding as a reinforcement learning (RL) environment that directly optimizes throughput rather than proxy metrics such as acceptance length. It introduces two interacting policies (Depth and Size), and the co-adaptive training framework is conceptually clear and empirically validated. - The algorithmic structure is clearly explained, and the figures effectively illustrate the RL formulation and the draft–verify cycle. - The proposed method demonstrates robustn
- The RL formulation lacks theoretical guarantees or convergence analysis under the throughput-based reward. - The paper does not report quantitative metrics related to training efficiency (e.g., sample efficiency, or convergence speed). - The paper does not discuss the rationale behind selecting PPO as the optimization algorithm for policy learning, nor whether alternative RL methods were considered. - While the ablation studies demonstrate co-adaptation effects, the analysis remains qualitativ
- The paper introduces a principled reinforcement learning framework that directly optimizes throughput rather than proxy metrics like acceptance length, addressing a key limitation in prior speculative decoding methods. - Extensive experiments across multiple LLMs and tasks demonstrate consistent and significant speedups (up to 36.4%) over strong baselines such as Eagle3. The analyses are also comprehensive and well-designed. - The method generalizes well to high-temperature decoding scenarios
- The RL-based policy training (on HumanEval for 100K and 1M PPO steps) is expensive and tuned on a specific dataset. I am concerned about the training cost and the transferability to unseen domains or longer-context tasks. Could you provide the training details, including the actual GPU hours and why training for such extremely long steps? - I am also concerned about the iterative optimization part. From Table 7 and 8, we can see that some models/datasets benefit from iterative optimization, wh
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
