Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge   Distillation on Language Models

Xiao Cui; Mo Zhu; Yulei Qin; Liang Xie; Wengang Zhou; Houqiang Li

arXiv:2412.14528·cs.CL·January 22, 2025

Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, Houqiang Li

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces MultiLevelOT, a novel optimal transport-based method for universal knowledge distillation that aligns logits across different tokenizers at multiple levels, improving robustness and performance in language models.

Contribution

The paper proposes MultiLevelOT, a new optimal transport approach that enables cross-tokenizer knowledge distillation without requiring identical vocabularies, applicable across diverse language model architectures.

Findings

01

Outperforms state-of-the-art cross-tokenizer KD methods.

02

Robust across different model families and sizes.

03

Effective on tasks like QA and summarization.

Abstract

Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the Multi-Level Optimal Transport (MultiLevelOT), a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels using diverse cost matrices, eliminating the need for dimensional or token-by-token correspondence. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness. At the sequence level, we efficiently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

2018cx/multi-level-ot
pytorchOfficial

Videos

Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models· underline

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Robotics and Automated Systems