When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

Xiaogeng Liu; Xinyan Wang; Yingzi Ma; Yechao Zhang; Chaowei Xiao

arXiv:2605.21606·cs.LG·May 22, 2026

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao

PDF

TL;DR

This paper introduces a position-weighted self-distillation method that improves reasoning model training by identifying and leveraging reliable teacher tokens based on trajectory-level structure.

Contribution

It proposes PW-OPSD, a novel on-policy self-distillation approach that weights tokens by position to enhance reasoning accuracy without extra teacher computation.

Findings

01

PW-OPSD improves reasoning accuracy on benchmark datasets.

02

Position scores outperform entropy-based measures in predicting token reliability.

03

The method generalizes well across different model sizes and architectures.

Abstract

On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliable at every student-visited prefix. Existing entropy-based OPD methods relax this uniformity by modulating token-level supervision with teacher entropy, but high teacher entropy in reasoning has an ambiguous reliability meaning: it can reflect either non-viable uncertainty or benign solution diversity. To identify this phenomenon, we introduce a branch-viability diagnostic. Specifically, we record next-token alternatives from the privileged-answer teacher prompt, force each alternative after the student prompt plus its on-policy spine prefix, and test whether the resulting student-template continuation recovers the correct answer. On Qwen3-4B, we find that an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.