TL;DR
SimCT enhances on-policy distillation by comparing teacher and student over multi-token continuations, recovering supervision lost due to tokenizer differences, leading to improved performance in mathematical reasoning and code-generation tasks.
Contribution
It introduces SimCT, a method that enlarges the supervision space in on-policy distillation to recover lost signals from heterogeneous tokenizers.
Findings
SimCT outperforms shared-vocabulary OPD and cross-tokenizer baselines.
Recovering supervision discarded by shared-token matching improves learning.
Consistent gains observed across mathematical reasoning and code-generation benchmarks.
Abstract
On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whenever the two models tokenize the same text differently. Under heterogeneous tokenizers, exact shared-token matching silently discards a large fraction of the teacher signal at precisely the positions where vocabularies disagree. We propose \textbf{\underline{Sim}ple \underline{C}ross-\underline{T}okenizer OPD (SimCT)}, which restores this signal by enlarging the supervision space: alongside shared tokens, SimCT compares teacher and student over short multi-token continuations that both tokenizers can realize, leaving the OPD loss form itself unchanged. We show that these units are the finest jointly tokenizable supervision interface, and that coarser alternatives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
