TL;DR
This paper provides a comprehensive empirical analysis of on-policy distillation methods for large language models, identifying their strengths, failure modes, and proposing mitigation strategies.
Contribution
It systematically investigates when on-policy distillation works or fails, and introduces techniques to address identified failure mechanisms.
Findings
OPD on mathematical reasoning is sensitive to teacher choice.
OPSD fails without instance-specific privileged information.
Mitigation strategies include stop-gradient TopK and RLVR-adapted teachers.
Abstract
On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
