TokAlign: Efficient Vocabulary Adaptation via Token Alignment

Chong Li; Jiajun Zhang; Chengqing Zong

arXiv:2506.03523·cs.CL·June 5, 2025

TokAlign: Efficient Vocabulary Adaptation via Token Alignment

Chong Li, Jiajun Zhang, Chengqing Zong

PDF

Open Access 1 Repo 1 Video

TL;DR

TokAlign offers an efficient vocabulary adaptation method for LLMs by aligning token co-occurrences, enabling better knowledge transfer, improved multilingual compression, and faster model fine-tuning with minimal steps.

Contribution

The paper introduces TokAlign, a novel token alignment technique that efficiently replaces LLM vocabularies, enhancing multilingual capabilities and token-level knowledge transfer.

Findings

01

Reduces perplexity from 340 to 120 after initialization.

02

Restores model performance in as few as 5,000 steps.

03

Boosts token-level distillation gains by 4.4% over sentence-level methods.

Abstract

Tokenization serves as a foundational step for Large Language Models (LLMs) to process text. In new domains or languages, the inefficiency of the tokenizer will slow down the training and generation of LLM. The mismatch in vocabulary also hinders deep knowledge transfer between LLMs like token-level distillation. To mitigate this gap, we propose an efficient method named TokAlign to replace the vocabulary of LLM from the token co-occurrences view, and further transfer the token-level knowledge between models. It first aligns the source vocabulary to the target one by learning a one-to-one mapping matrix for token IDs. Model parameters, including embeddings, are rearranged and progressively fine-tuned for the new vocabulary. Our method significantly improves multilingual text compression rates and vocabulary initialization for LLMs, decreasing the perplexity from 3.4 $e^{2}$ of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

znlp/tokalign
pytorchOfficial

Videos

TokAlign: Efficient Vocabulary Adaptation via Token Alignment· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare

MethodsBalanced Selection