The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Siqi Zhu; Xuyan Ye; Hongyu Lu; Weiye Shi; Ge Liu

arXiv:2605.11182·cs.AI·May 13, 2026

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Siqi Zhu, Xuyan Ye, Hongyu Lu, Weiye Shi, Ge Liu

PDF

1 Repo

TL;DR

This paper provides a comprehensive empirical analysis of on-policy distillation methods for large language models, identifying their strengths, failure modes, and proposing mitigation strategies.

Contribution

It systematically investigates when on-policy distillation works or fails, and introduces techniques to address identified failure mechanisms.

Findings

01

OPD on mathematical reasoning is sensitive to teacher choice.

02

OPSD fails without instance-specific privileged information.

03

Mitigation strategies include stop-gradient TopK and RLVR-adapted teachers.

Abstract

On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ulab-uiuc/Open-On-Policy-Distillation
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.