Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models
Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, Houqiang Li

TL;DR
This paper introduces MultiLevelOT, a novel optimal transport-based method for universal knowledge distillation that aligns logits across different tokenizers at multiple levels, improving robustness and performance in language models.
Contribution
The paper proposes MultiLevelOT, a new optimal transport approach that enables cross-tokenizer knowledge distillation without requiring identical vocabularies, applicable across diverse language model architectures.
Findings
Outperforms state-of-the-art cross-tokenizer KD methods.
Robust across different model families and sizes.
Effective on tasks like QA and summarization.
Abstract
Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the Multi-Level Optimal Transport (MultiLevelOT), a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels using diverse cost matrices, eliminating the need for dimensional or token-by-token correspondence. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness. At the sequence level, we efficiently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Robotics and Automated Systems
