TL;DR
This paper reveals that on-policy distillation's efficiency in large language models is due to its foresight in establishing stable update trajectories early, and introduces EffOPD, a method that accelerates this process.
Contribution
The paper uncovers the parameter-level mechanisms behind OPD's efficiency and proposes EffOPD, a simple, plug-and-play acceleration technique that triples training speed without extra modules.
Findings
OPD's efficiency is due to early stable update trajectories.
EffOPD accelerates OPD by 3x without extra modules.
OPD's dominant subspaces align with final updates early in training.
Abstract
On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level mechanisms underlying OPD's efficiency remain poorly understood. In this work, we argue that OPD's efficiency stems from a form of ``foresight'': it establishes a stable update trajectory toward the final model early in training. This foresight manifests in two aspects. First, at the \textbf{Module-Allocation Level}, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning. Second, at the \textbf{Update-Direction Level}, OPD exhibits stronger low-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training. Building on these findings, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
