TL;DR
This paper analyzes the failure modes of on-policy distillation in LLM training and proposes a simple, effective fix using top-K local support matching to improve stability and performance.
Contribution
It introduces a new top-K local support matching objective that addresses key failure modes of sampled-token OPD, enhancing stability and performance in LLM distillation.
Findings
Identified three failure modes of sampled-token OPD: imbalance, unreliability, and mismatch.
Proposed top-K local support matching improves training stability.
Achieved a +19.8% performance gain over standard methods.
Abstract
On-policy distillation (OPD) is increasingly used in LLM post-training because it can leverage a teacher model to provide dense supervision on student rollouts. The standard implementation, however, usually reduces distribution matching to a sampled-token log-ratio, which can make the learning signal fragile on long rollouts whose prefixes drift away from the teacher's typical support. We revisit this formulation from both theoretical and implementation perspectives. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL minimization, but admits a substantially tighter worst-case variance bound; a controlled synthetic study further shows that stronger future-reward coupling increases gradient variance and destabilizes training. Empirically, we identify three failure modes of sampled-token OPD: imbalanced token-level supervision, unreliable teacher guidance on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
