Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP
Fran\c{c}ois Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya, Khabibullina, Miryam de Lhoneux, Thomas Demeester

TL;DR
This paper introduces trans-tokenization, a cross-lingual vocabulary transfer method that adapts high-resource language models to low-resource languages, enabling effective NLP tasks with minimal data and no parallel corpora.
Contribution
The study presents a novel trans-tokenization strategy and Hydra LLMs, facilitating language adaptation and zero-shot translation for low-resource languages without high-quality parallel data.
Findings
Competitive performance on downstream tasks across diverse languages
State-of-the-art zero-shot Tatar translation model
Reduced data and training time requirements for low-resource languages
Abstract
The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Tweeties/tweety-7b-dutch-v24amodel· 22 dl· ♡ 1522 dl♡ 15
- 🤗Tweeties/tweety-7b-tatar-v24amodel· 18 dl· ♡ 1218 dl♡ 12
- 🤗Tweeties/tweety-tatar-hydra-base-7b-v24amodel· 6 dl6 dl
- 🤗Tweeties/tweety-tatar-hydra-mt-7b-v24amodel· 8 dl8 dl
- 🤗Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-FRmodel· 11 dl· ♡ 611 dl♡ 6
- 🤗Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-NLmodel· 6 dl· ♡ 76 dl♡ 7
- 🤗Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-DEmodel· 7 dl· ♡ 57 dl♡ 5
- 🤗Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-ENmodel· 3 dl· ♡ 23 dl♡ 2
- 🤗Parallia/Fairly-Multilingual-ModernBERT-Embed-BEmodel· 42 dl· ♡ 2742 dl♡ 27
- 🤗NAMAA-Space/AraModernBert-Base-V1.0model· 97 dl· ♡ 1497 dl♡ 14
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsSparse Evolutionary Training · Hydra
