Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, Vladimir Braverman

TL;DR
This paper identifies a failure mode in on-policy distillation where length inflation causes training instability, and proposes StableOPD, a method to stabilize training and improve performance.
Contribution
The paper introduces StableOPD, a novel framework combining divergence constraints and rollout mixture distillation to address length inflation in OPD.
Findings
StableOPD prevents truncation collapse in multiple datasets.
Training stability is significantly improved with a 7.2% average performance boost.
The method effectively mitigates repetition saturation and biased gradients.
Abstract
On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
