Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
Kaiyuan Liu, Ziyuan Zhuang, Yang Bai, Bing Wang, Rongxiang Weng, Jieping Ye

TL;DR
This paper identifies a failure mode in on-policy distillation where dense supervision becomes ineffective and proposes a trajectory-specific release rule to focus supervision on teachable segments, improving distillation performance.
Contribution
It introduces a novel local teachability collapse concept and a release rule that enhances strong-to-weak on-policy distillation by concentrating supervision on discriminative trajectory regions.
Findings
The release rule outperforms standard full-trajectory distillation across multiple benchmarks.
Focusing supervision on teachable segments preserves model capabilities on out-of-domain tasks.
The approach improves distillation efficiency by evaluating local utility of teacher feedback.
Abstract
On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
