Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu; Haohuan Huang; Kaiwen Jiang; Jiacai Liu; Zhuo Jiang; Yuanheng Zhu; Dongbin Zhao

arXiv:2603.25562·cs.LG·April 28, 2026

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, Dongbin Zhao

PDF

1 Repo

TL;DR

This paper analyzes the failure modes of on-policy distillation in LLM training and proposes a simple, effective fix using top-K local support matching to improve stability and performance.

Contribution

It introduces a new top-K local support matching objective that addresses key failure modes of sampled-token OPD, enhancing stability and performance in LLM distillation.

Findings

01

Identified three failure modes of sampled-token OPD: imbalance, unreliability, and mismatch.

02

Proposed top-K local support matching improves training stability.

03

Achieved a +19.8% performance gain over standard methods.

Abstract

On-policy distillation (OPD) is increasingly used in LLM post-training because it can leverage a teacher model to provide dense supervision on student rollouts. The standard implementation, however, usually reduces distribution matching to a sampled-token log-ratio, which can make the learning signal fragile on long rollouts whose prefixes drift away from the teacher's typical support. We revisit this formulation from both theoretical and implementation perspectives. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL minimization, but admits a substantially tighter worst-case variance bound; a controlled synthetic study further shows that stronger future-reward coupling increases gradient variance and destabilizes training. Empirically, we identify three failure modes of sampled-token OPD: imbalanced token-level supervision, unreliable teacher guidance on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hhh675597/revisiting_opd
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.