When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning
Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao

TL;DR
This paper introduces a position-weighted self-distillation method that improves reasoning model training by identifying and leveraging reliable teacher tokens based on trajectory-level structure.
Contribution
It proposes PW-OPSD, a novel on-policy self-distillation approach that weights tokens by position to enhance reasoning accuracy without extra teacher computation.
Findings
PW-OPSD improves reasoning accuracy on benchmark datasets.
Position scores outperform entropy-based measures in predicting token reliability.
The method generalizes well across different model sizes and architectures.
Abstract
On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliable at every student-visited prefix. Existing entropy-based OPD methods relax this uniformity by modulating token-level supervision with teacher entropy, but high teacher entropy in reasoning has an ambiguous reliability meaning: it can reflect either non-viable uncertainty or benign solution diversity. To identify this phenomenon, we introduce a branch-viability diagnostic. Specifically, we record next-token alternatives from the privileged-answer teacher prompt, force each alternative after the student prompt plus its on-policy spine prefix, and test whether the resulting student-template continuation recovers the correct answer. On Qwen3-4B, we find that an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
