Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Mohammadreza Armandpour, Fatih Ilhan, David Harrison, Ajay Jaiswal, Duc N.M Hoang, Fartash Faghri, Yizhe Zhang, Minsik Cho, Mehrdad Farajtabar

TL;DR
This paper introduces a diagnostic framework for on-policy distillation, analyzing per-token guidance effectiveness, and reveals that optimal configurations vary with task and model, highlighting the need for tailored approaches.
Contribution
It develops a scalable, training-free method to evaluate distillation signals at a fine-grained level, providing insights into when and how distillation helps or hurts.
Findings
Distillation guidance aligns better with the ideal on incorrect rollouts.
No single distillation configuration is universally optimal.
Per-task, per-token diagnostics are crucial for effective distillation.
Abstract
On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
