A Brief Overview: On-Policy Self-Distillation In Large Language Models
Fangming Cui, Sunan Li, Jiahong Li

TL;DR
This paper provides a concise overview of On-Policy Self-Distillation (OPSD), a framework where large language models learn by self-distillation, reducing memory use and eliminating the need for external teachers.
Contribution
It offers a beginner-friendly analysis of OPSD's conceptual foundations, methodological innovations, and design principles in large language models.
Findings
OPSD reduces GPU memory consumption by 40%-60%.
It aligns reasoning behavior with solution rationalizations.
Eliminates reliance on external teacher models.
Abstract
On-Policy Self-Distillation (OPSD) is a unified learning framework in which a single large language model acts simultaneously as both teacher and student. Unlike conventional knowledge distillation that relies on a separate, often larger teacher model, OPSD operates under different contextual roles: the teacher policy is granted privileged access to verified reasoning traces, while the student policy observes only the problem statement. OPSD is trained to minimize per-token distributional divergence between the two roles over trajectories sampled from the student itself, thereby aligning its own reasoning behavior with solution-aware rationalizations. OPSD eliminates the need for an external teacher, directly leverages ground-truth solution information, and resolves the distribution mismatch inherent in off-policy distillation. OPSD typically reduces GPU memory consumption by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
