Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing
Jie Jiang, Xing Sun, Ruotian Chen, Jianan Su, Kaixin Shen

TL;DR
This paper introduces PPOW, a reinforcement learning framework that optimizes speculative decoding at the window level, significantly improving inference speed and efficiency for large language models.
Contribution
PPOW shifts the focus from token-level to window-level optimization using reinforcement learning, enhancing speculative decoding performance.
Findings
Achieves average acceptance lengths of 6.29-6.52 tokens.
Realizes speedups of 3.39-4.36 times across multiple models.
Demonstrates practical window-level optimization improves decoding efficiency.
Abstract
Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
