Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, Zhao Yang

TL;DR
This paper uncovers persistent high-loss 'Rock Tokens' in on-policy distillation that do not contribute to reasoning, suggesting potential for more efficient training by bypassing these tokens.
Contribution
It reveals the existence and characteristics of Rock Tokens in OPD, challenging the assumption that all tokens should be equally optimized during training.
Findings
Rock Tokens can account for up to 18% of tokens in outputs.
Despite high frequency, Rock Tokens resist teacher corrections.
Rock Tokens have negligible impact on reasoning performance.
Abstract
While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18\% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
