TIP: Token Importance in On-Policy Distillation

Yuanda Xu; Hejian Sang; Zhengze Zhou; Ran He; Zhipeng Wang; Alborz Geramifard

arXiv:2604.14084·cs.LG·May 22, 2026

TIP: Token Importance in On-Policy Distillation

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard

PDF

1 Repo

TL;DR

This paper introduces TIP, a taxonomy for token importance in on-policy distillation, revealing that selecting tokens based on entropy and divergence improves efficiency and performance.

Contribution

The paper proposes a two-axis taxonomy for token importance, combining entropy and divergence, and demonstrates its effectiveness across multiple models and tasks.

Findings

01

Entropy-based sampling retains 50% of tokens with comparable performance.

02

Focusing on low-entropy, high-divergence tokens with less than 10% of tokens nearly matches full-token training.

03

Q3-only training on less than 20% of tokens surpasses full-token OPD.

Abstract

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47%$ . But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10%$ of all tokens nearly matches full-token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HJSang/OPSD_OnPolicyDistillation
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.