A Survey of On-Policy Distillation for Large Language Models

Mingyang Song; Mao Zheng

arXiv:2604.00626·cs.LG·May 19, 2026

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song, Mao Zheng

PDF

TL;DR

This survey reviews on-policy distillation methods for large language models, formalizing the approach, organizing existing techniques, and identifying open challenges for future research.

Contribution

It provides a unified formal framework for on-policy distillation, organizes diverse methods along key design axes, and synthesizes insights across related fields.

Findings

01

Formalization of OPD as divergence minimization

02

Organization of methods along optimization, signal source, and stabilization axes

03

Identification of open problems like scaling laws and uncertainty-aware feedback

Abstract

As Large Language Models (LLMs) continue to grow in both capability and cost, transferring frontier capabilities into smaller, deployable students has become a central engineering problem, and knowledge distillation remains the dominant technique for this transfer. The prevailing recipe in industrial pipelines, static imitation of teacher-generated text, carries a structural weakness that grows more severe as tasks become longer and more reasoning-intensive. Because the student is trained on flawless teacher prefixes but must generate its own at inference, small errors tend to accumulate into trajectories it has rarely been trained to recover from, and the resulting exposure bias has been shown to scale roughly with the square of sequence length. On-Policy Distillation (OPD) reorganizes the training loop around this observation by having the teacher provide feedback on what the student…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.