A Survey of On-Policy Distillation for Large Language Models
Mingyang Song, Mao Zheng

TL;DR
This survey reviews on-policy distillation methods for large language models, formalizing the approach, organizing existing techniques, and identifying open challenges for future research.
Contribution
It provides a unified formal framework for on-policy distillation, organizes diverse methods along key design axes, and synthesizes insights across related fields.
Findings
Formalization of OPD as divergence minimization
Organization of methods along optimization, signal source, and stabilization axes
Identification of open problems like scaling laws and uncertainty-aware feedback
Abstract
As Large Language Models (LLMs) continue to grow in both capability and cost, transferring frontier capabilities into smaller, deployable students has become a central engineering problem, and knowledge distillation remains the dominant technique for this transfer. The prevailing recipe in industrial pipelines, static imitation of teacher-generated text, carries a structural weakness that grows more severe as tasks become longer and more reasoning-intensive. Because the student is trained on flawless teacher prefixes but must generate its own at inference, small errors tend to accumulate into trajectories it has rarely been trained to recover from, and the resulting exposure bias has been shown to scale roughly with the square of sequence length. On-Policy Distillation (OPD) reorganizes the training loop around this observation by having the teacher provide feedback on what the student…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
