SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

Jie Sun; Mao Zheng; Mingyang Song; Qiyong Zhong; Yilin Cheng; Bichuan Feng; Pengfei Liu; Junfeng Fang; Xiang Wang

arXiv:2605.07711·cs.CL·May 22, 2026

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

Jie Sun, Mao Zheng, Mingyang Song, Qiyong Zhong, Yilin Cheng, Bichuan Feng, Pengfei Liu, Junfeng Fang, Xiang Wang

PDF

1 Repo

TL;DR

SimCT enhances on-policy distillation by comparing teacher and student over multi-token continuations, recovering supervision lost due to tokenizer differences, leading to improved performance in mathematical reasoning and code-generation tasks.

Contribution

It introduces SimCT, a method that enlarges the supervision space in on-policy distillation to recover lost signals from heterogeneous tokenizers.

Findings

01

SimCT outperforms shared-vocabulary OPD and cross-tokenizer baselines.

02

Recovering supervision discarded by shared-token matching improves learning.

03

Consistent gains observed across mathematical reasoning and code-generation benchmarks.

Abstract

On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whenever the two models tokenize the same text differently. Under heterogeneous tokenizers, exact shared-token matching silently discards a large fraction of the teacher signal at precisely the positions where vocabularies disagree. We propose \textbf{\underline{Sim}ple \underline{C}ross-\underline{T}okenizer OPD (SimCT)}, which restores this signal by enlarging the supervision space: alongside shared tokens, SimCT compares teacher and student over short multi-token continuations that both tokenizers can realize, leaving the OPD loss form itself unchanged. We show that these units are the finest jointly tokenizable supervision interface, and that coarser alternatives…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sunjie279/SimCT-
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.