Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Mohammadreza Armandpour; Fatih Ilhan; David Harrison; Ajay Jaiswal; Duc N.M Hoang; Fartash Faghri; Yizhe Zhang; Minsik Cho; Mehrdad Farajtabar

arXiv:2605.10889·cs.LG·May 12, 2026

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Mohammadreza Armandpour, Fatih Ilhan, David Harrison, Ajay Jaiswal, Duc N.M Hoang, Fartash Faghri, Yizhe Zhang, Minsik Cho, Mehrdad Farajtabar

PDF

TL;DR

This paper introduces a diagnostic framework for on-policy distillation, analyzing per-token guidance effectiveness, and reveals that optimal configurations vary with task and model, highlighting the need for tailored approaches.

Contribution

It develops a scalable, training-free method to evaluate distillation signals at a fine-grained level, providing insights into when and how distillation helps or hurts.

Findings

01

Distillation guidance aligns better with the ideal on incorrect rollouts.

02

No single distillation configuration is universally optimal.

03

Per-task, per-token diagnostics are crucial for effective distillation.

Abstract

On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.