TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
Chong Li, Yingzhuo Deng, Wen Yang, Jiajun Zhang, Chengqing Zong

TL;DR
TokAlign++ enhances vocabulary adaptation in large language models by learning better token alignment, leading to improved multilingual compression and efficient model performance restoration with minimal fine-tuning.
Contribution
The paper introduces TokAlign++, a novel method for improving vocabulary adaptation through bilingual token alignment learned from monolingual representations.
Findings
Boosts multilingual text compression rates.
Restores vanilla model performance with as few as 1k steps.
Improves token-level distillation effectiveness.
Abstract
Tokenization is a foundational step in the text process of Large Language Models (LLMs). Texts must be first tokenized into token IDs, which are then input to LLMs. Inefficient tokenization results in long token-ID sequences and will slow down the training and inference of LLMs. The fine-grained knowledge transfer between LLMs, like token-level distillation, is also impeded by the mismatch in vocabulary. To bridge this gap, we introduce a method named TokAlign++ to improve vocabulary adaptation performance by learning better token alignment lexicon. The source and target vocabularies are taken as two different languages, and the bilingual token alignment lexicon is learned from monolingual token representations. Model parameters are rearranged following this bilingual lexicon for new vocabulary, and progressively fine-tuned for adaptation. Experimental results on 15 languages show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
