TL;DR
This paper introduces TIP, a taxonomy for token importance in on-policy distillation, revealing that selecting tokens based on entropy and divergence improves efficiency and performance.
Contribution
The paper proposes a two-axis taxonomy for token importance, combining entropy and divergence, and demonstrates its effectiveness across multiple models and tasks.
Findings
Entropy-based sampling retains 50% of tokens with comparable performance.
Focusing on low-entropy, high-divergence tokens with less than 10% of tokens nearly matches full-token training.
Q3-only training on less than 20% of tokens surpasses full-token OPD.
Abstract
On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to . But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than of all tokens nearly matches full-token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
