Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

Nan Jia; Haojin Yang; Xing Ma; Jiesong Lian; Shuailiang Zhang; Weipeng Zhang; Ke Zeng; Xunliang Cai; Zequn Sun

arXiv:2605.06387·cs.LG·May 14, 2026

Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

Nan Jia, Haojin Yang, Xing Ma, Jiesong Lian, Shuailiang Zhang, Weipeng Zhang, Ke Zeng, Xunliang Cai, Zequn Sun

PDF

TL;DR

This paper introduces Asymmetric On-Policy Distillation (AOPD), a method that improves token-level policy learning by addressing weaknesses in standard OPD, leading to better performance and exploration.

Contribution

AOPD replaces negative reinforcement with localized divergence minimization, enhancing on-policy distillation's effectiveness and stability.

Findings

01

AOPD outperforms standard OPD on mathematical reasoning benchmarks.

02

AOPD maintains higher policy entropy during training.

03

AOPD shows better retention during sequential tool-use adaptation.

Abstract

On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient. We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.