Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li; Yuxin Zuo; Bingxiang He; Jinqian Zhang; Chaojun Xiao; Cheng Qian; Tianyu Yu; Huan-ang Gao; Wenkai Yang; Zhiyuan Liu; Ning Ding

arXiv:2604.13016·cs.LG·April 16, 2026

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding

PDF

1 Repo 2 Models 1 Datasets

TL;DR

This paper systematically investigates the dynamics of on-policy distillation in large language models, revealing key conditions for success and proposing strategies to improve it, while questioning its scalability for long-horizon tasks.

Contribution

It provides a detailed analysis of OPD mechanisms, identifies critical factors for success, and introduces practical strategies to enhance distillation effectiveness.

Findings

01

Successful OPD requires compatible thinking patterns between teacher and student.

02

High-probability token alignment is crucial for effective distillation.

03

Off-policy cold start and teacher-aligned prompts can recover failing distillation.

Abstract

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/OPD
github

Models

Datasets

lllyx/OpenThought3-Qwen3-4B
dataset· 142 dl
142 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.