Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
Nan Jia, Haojin Yang, Xing Ma, Jiesong Lian, Shuailiang Zhang, Weipeng Zhang, Ke Zeng, Xunliang Cai, Zequn Sun

TL;DR
This paper introduces Asymmetric On-Policy Distillation (AOPD), a method that improves token-level policy learning by addressing weaknesses in standard OPD, leading to better performance and exploration.
Contribution
AOPD replaces negative reinforcement with localized divergence minimization, enhancing on-policy distillation's effectiveness and stability.
Findings
AOPD outperforms standard OPD on mathematical reasoning benchmarks.
AOPD maintains higher policy entropy during training.
AOPD shows better retention during sequential tool-use adaptation.
Abstract
On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient. We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
