Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation
Fran\c{c}ois Remy, Pieter Delobelle, Bettina Berendt, Kris Demuynck,, Thomas Demeester

TL;DR
This paper introduces a novel token mapping strategy to adapt high-resource language models to low-resource languages, significantly improving initialization and performance with less data and training time.
Contribution
The paper proposes a new model conversion method using a word translation dictionary to improve embedding initialization for low-resource languages.
Findings
Achieved state-of-the-art results on Dutch and Frisian tasks.
Reduced training data and time needed for effective language adaptation.
Demonstrated effectiveness across multiple downstream tasks.
Abstract
Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. In this study, we propose a novel model conversion strategy to address this issue, adapting high-resources monolingual language models to a new target language. By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer. This one-to-many token mapping improves tremendously the initialization of the embedding table for the target language. We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian. These converted models achieve a new state-of-the-art performance on these languages across all sorts of downstream tasks. By reducing significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗DTAI-KULeuven/robbert-2023-dutch-largemodel· 2.5k dl· ♡ 212.5k dl♡ 21
- 🤗DTAI-KULeuven/robbert-2023-dutch-basemodel· 14k dl· ♡ 714k dl♡ 7
- 🤗ChocoLlama/ChocoLlama-2-7B-tokentrans-basemodel· 5 dl5 dl
- 🤗ChocoLlama/ChocoLlama-2-7B-basemodel· 8 dl· ♡ 28 dl♡ 2
- 🤗ChocoLlama/ChocoLlama-2-7B-instructmodel· 8 dl· ♡ 28 dl♡ 2
- 🤗ChocoLlama/ChocoLlama-2-7B-tokentrans-instructmodel· 14 dl· ♡ 114 dl♡ 1
- 🤗ChocoLlama/Llama-3-ChocoLlama-8B-basemodel· 8 dl· ♡ 18 dl♡ 1
- 🤗ChocoLlama/Llama-3-ChocoLlama-8B-instructmodel· 11 dl· ♡ 611 dl♡ 6
- 🤗RichardErkhov/ChocoLlama_-_ChocoLlama-2-7B-instruct-8bitsmodel· 2 dl2 dl
- 🤗RichardErkhov/ChocoLlama_-_ChocoLlama-2-7B-base-8bitsmodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
