$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control
Xianwei Chen,Shimin Zhang,Jibin Wu

TL;DR
f-OPD introduces a freshness-aware control framework to stabilize long-horizon on-policy distillation, balancing performance and efficiency in large language model training.
Contribution
It provides a theoretical decomposition of objective discrepancies and a novel adaptive regulation method for asynchronous on-policy distillation.
Findings
f-OPD achieves performance comparable to synchronous methods.
It maintains high throughput while stabilizing long-horizon distillation.
The framework is effective across reasoning, tool-use, and coding tasks.
Abstract
Scaling on-policy distillation (OPD) for large language models (LLMs) confronts a fundamental tension: asynchronous execution is necessary for system efficiency, but structurally deviates from the ideal on-policy objective. To address this challenge, we theoretically decompose the objective discrepancy into rollout drift and supervision drift, capturing staleness in student rollout and teacher context, respectively. Building on this, we introduce a sample-level freshness score that quantifies the reliability of a buffered sample with respect to the on-policy objective. Guided by this signal, we further propose f-OPD, a novel framework that adaptively regulates stale-sample influence and constrains policy drift accumulated under asynchronous training. Across reasoning, tool-use, and coding-agent tasks of increasing interaction horizon, f-OPD consistently achieves task performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
