Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language   Adaptation of LLMs for Low-Resource NLP

Fran\c{c}ois Remy; Pieter Delobelle; Hayastan Avetisyan; Alfiya; Khabibullina; Miryam de Lhoneux; Thomas Demeester

arXiv:2408.04303·cs.CL·August 9, 2024·2 cites

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

Fran\c{c}ois Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya, Khabibullina, Miryam de Lhoneux, Thomas Demeester

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper introduces trans-tokenization, a cross-lingual vocabulary transfer method that adapts high-resource language models to low-resource languages, enabling effective NLP tasks with minimal data and no parallel corpora.

Contribution

The study presents a novel trans-tokenization strategy and Hydra LLMs, facilitating language adaptation and zero-shot translation for low-resource languages without high-quality parallel data.

Findings

01

Competitive performance on downstream tasks across diverse languages

02

State-of-the-art zero-shot Tatar translation model

03

Reduced data and training time requirements for low-resource languages

Abstract

The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lagom-nlp/transtokenizer
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsSparse Evolutionary Training · Hydra