Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao; Zhihui Xie; Mengchen Liu; Jing Huang; Guan Pang; Feiyu Chen; Aditya Grover

arXiv:2601.18734·cs.LG·March 23, 2026

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover

PDF

Open Access 1 Models

TL;DR

This paper introduces On-Policy Self-Distillation (OPSD), a novel training method where a single large language model acts as both teacher and student, improving reasoning performance and efficiency by leveraging privileged information and self-generated trajectories.

Contribution

The paper proposes OPSD, a new on-policy self-distillation algorithm that enables a single LLM to teach itself using privileged information, reducing reliance on larger external teachers and enhancing reasoning capabilities.

Findings

01

OPSD outperforms off-policy distillation in reasoning benchmarks.

02

The method achieves higher token efficiency than reinforcement learning approaches.

03

Single-model self-distillation improves reasoning accuracy with less data.

Abstract

Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self, we introduce On-Policy Self-Distillation (OPSD), a learning algorithm where a single LLM acts as both teacher and student with different contexts. The teacher policy conditions on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
JLiangHe/OPSD_exp
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques