Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing
Miao Rang, Zhenni Bi, Hang Zhou, Kai Han, Xuechun Wang, An Xiao, Xinghao Chen, Yunhe Wang, Hanting Chen

TL;DR
Near-Policy Distillation (NPD) accelerates on-policy model training by decoupling generation and training, using selective filtering and sparse updates to maintain stability and improve efficiency, achieving significant speedups and performance gains.
Contribution
The paper introduces NPD, an asynchronous distillation framework with novel filtering and update strategies that enhance efficiency and stability in on-policy autoregressive model training.
Findings
Achieves 8.1x speedup over on-policy baselines.
Outperforms supervised fine-tuning by 8.09%.
Enables openPangu-Embedded-1B to reach a state-of-the-art score of 68.73%.
Abstract
Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To improve efficiency, we propose Near-Policy Distillation (NPD), an asynchronous approach that decouples student generation from training. This reformulation enables Supervised Fine-Tuning (SFT) with sequence packing. However, asynchronous updates inevitably introduce policy lag and sample noise, which can cause the behavior to drift from near-policy toward off-policy. To counteract this without sacrificing efficiency, NPD integrates sparse student updates and the -IFD filtering mechanism, a heuristic sample selection mechanism that empirically stabilizes the optimization trajectory. By filtering extreme out-of-distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
