X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

Sharath Turuvekere Sreenivas; Adithyakrishna Venkatesh Hanasoge; Mingyu Yang; Ali Taghibakhshi; Saurav Muralidharan; Ashwath Aithal; Pavlo Molchanov

arXiv:2605.21699·cs.LG·May 22, 2026

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

Sharath Turuvekere Sreenivas, Adithyakrishna Venkatesh Hanasoge, Mingyu Yang, Ali Taghibakhshi, Saurav Muralidharan, Ashwath Aithal, Pavlo Molchanov

PDF

TL;DR

X-Token introduces a novel projection-guided cross-tokenizer knowledge distillation method that effectively addresses token misalignment issues, significantly improving model performance over previous techniques.

Contribution

It proposes two complementary loss functions, P-KL and H-KL, utilizing a sparse projection matrix to enhance knowledge transfer across incompatible vocabularies.

Findings

01

X-Token outperforms state-of-the-art methods by +3.82 points with Qwen3-4B teacher.

02

Two-teacher setup improves performance by +1.3 points over single-teacher distillation.

03

Addresses token misalignment issues, reducing the impact of uncommon tokens.

Abstract

Cross-tokenizer knowledge distillation allows a student model to learn from teachers with incompatible vocabularies. Prior work operates on hidden states or logits; the latter is preferred as a drop-in replacement requiring no auxiliary components. Logit-based methods either use only the correct-token probability, missing the full 'dark knowledge' in the teacher's distribution, or operate on the full output distribution, relying on strict token partitioning and/or unprincipled heuristic ranking. We identify two key shortcomings of full-distribution, logit-based methods: (i) an uncommon-token failure, where critical tokens fall into the unmatched subset (e.g., Llama's 1100 multi-digit numerals under digit-splitting Qwen supervision) and are suppressed during training, reducing GSM8k from 12.89 to 2.56 compared to same-tokenizer KD from a weaker teacher; and (ii) over-conservative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.