A Brief Overview: On-Policy Self-Distillation In Large Language Models

Fangming Cui; Sunan Li; Jiahong Li

arXiv:2605.18141·cs.HC·May 22, 2026

A Brief Overview: On-Policy Self-Distillation In Large Language Models

Fangming Cui, Sunan Li, Jiahong Li

PDF

TL;DR

This paper provides a concise overview of On-Policy Self-Distillation (OPSD), a framework where large language models learn by self-distillation, reducing memory use and eliminating the need for external teachers.

Contribution

It offers a beginner-friendly analysis of OPSD's conceptual foundations, methodological innovations, and design principles in large language models.

Findings

01

OPSD reduces GPU memory consumption by 40%-60%.

02

It aligns reasoning behavior with solution rationalizations.

03

Eliminates reliance on external teacher models.

Abstract

On-Policy Self-Distillation (OPSD) is a unified learning framework in which a single large language model acts simultaneously as both teacher and student. Unlike conventional knowledge distillation that relies on a separate, often larger teacher model, OPSD operates under different contextual roles: the teacher policy is granted privileged access to verified reasoning traces, while the student policy observes only the problem statement. OPSD is trained to minimize per-token distributional divergence between the two roles over trajectories sampled from the student itself, thereby aligning its own reasoning behavior with solution-aware rationalizations. OPSD eliminates the need for an external teacher, directly leverages ground-truth solution information, and resolves the distribution mismatch inherent in off-policy distillation. OPSD typically reduces GPU memory consumption by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.