Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover

TL;DR
This paper introduces On-Policy Self-Distillation (OPSD), a novel training method where a single large language model acts as both teacher and student, improving reasoning performance and efficiency by leveraging privileged information and self-generated trajectories.
Contribution
The paper proposes OPSD, a new on-policy self-distillation algorithm that enables a single LLM to teach itself using privileged information, reducing reliance on larger external teachers and enhancing reasoning capabilities.
Findings
OPSD outperforms off-policy distillation in reasoning benchmarks.
The method achieves higher token efficiency than reinforcement learning approaches.
Single-model self-distillation improves reasoning accuracy with less data.
Abstract
Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self, we introduce On-Policy Self-Distillation (OPSD), a learning algorithm where a single LLM acts as both teacher and student with different contexts. The teacher policy conditions on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
