When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

Junxiong Wang; Fengxiang Bie; Jisen Li; Zhongzhu Zhou; Zelei Shao; Yubo Wang; Yinghui Liu; Qingyang Wu; Avner May; Sri Yanamandra; Ce Zhang; Tri Dao; Percy Liang; Ben Athiwaratkun; Shuaiwen Leon Song; Chenfeng Xu; Xiaoxia Wu

arXiv:2602.06932·cs.LG·May 18, 2026

When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

Junxiong Wang, Fengxiang Bie, Jisen Li, Zhongzhu Zhou, Zelei Shao, Yubo Wang, Yinghui Liu, Qingyang Wu, Avner May, Sri Yanamandra, Ce Zhang, Tri Dao, Percy Liang, Ben Athiwaratkun, Shuaiwen Leon Song, Chenfeng Xu, Xiaoxia Wu

PDF

3 Models

TL;DR

Aurora introduces a unified system that continuously learns and adapts speculative decoding models for large language models directly from live inference data, enabling immediate deployment and improved performance.

Contribution

It proposes Aurora, a reinforcement learning-based framework that allows real-time speculator adaptation, reducing deployment lag and handling domain shifts effectively.

Findings

01

Achieves 1.5x day-0 speedup on frontier models.

02

Delivers an additional 1.25x speedup during distribution shifts.

03

Supports hot-swapped speculator updates without service interruption.

Abstract

Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.