TokAlign: Efficient Vocabulary Adaptation via Token Alignment
Chong Li, Jiajun Zhang, Chengqing Zong

TL;DR
TokAlign offers an efficient vocabulary adaptation method for LLMs by aligning token co-occurrences, enabling better knowledge transfer, improved multilingual compression, and faster model fine-tuning with minimal steps.
Contribution
The paper introduces TokAlign, a novel token alignment technique that efficiently replaces LLM vocabularies, enhancing multilingual capabilities and token-level knowledge transfer.
Findings
Reduces perplexity from 340 to 120 after initialization.
Restores model performance in as few as 5,000 steps.
Boosts token-level distillation gains by 4.4% over sentence-level methods.
Abstract
Tokenization serves as a foundational step for Large Language Models (LLMs) to process text. In new domains or languages, the inefficiency of the tokenizer will slow down the training and generation of LLM. The mismatch in vocabulary also hinders deep knowledge transfer between LLMs like token-level distillation. To mitigate this gap, we propose an efficient method named TokAlign to replace the vocabulary of LLM from the token co-occurrences view, and further transfer the token-level knowledge between models. It first aligns the source vocabulary to the target one by learning a one-to-one mapping matrix for token IDs. Model parameters, including embeddings, are rearranged and progressively fine-tuned for the new vocabulary. Our method significantly improves multilingual text compression rates and vocabulary initialization for LLMs, decreasing the perplexity from 3.4 of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
MethodsBalanced Selection
