Tik-to-Tok: Translating Language Models One Token at a Time: An   Embedding Initialization Strategy for Efficient Language Adaptation

Fran\c{c}ois Remy; Pieter Delobelle; Bettina Berendt; Kris Demuynck,; Thomas Demeester

arXiv:2310.03477·cs.CL·October 6, 2023

Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation

Fran\c{c}ois Remy, Pieter Delobelle, Bettina Berendt, Kris Demuynck,, Thomas Demeester

PDF

Open Access 10 Models

TL;DR

This paper introduces a novel token mapping strategy to adapt high-resource language models to low-resource languages, significantly improving initialization and performance with less data and training time.

Contribution

The paper proposes a new model conversion method using a word translation dictionary to improve embedding initialization for low-resource languages.

Findings

01

Achieved state-of-the-art results on Dutch and Frisian tasks.

02

Reduced training data and time needed for effective language adaptation.

03

Demonstrated effectiveness across multiple downstream tasks.

Abstract

Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. In this study, we propose a novel model conversion strategy to address this issue, adapting high-resources monolingual language models to a new target language. By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer. This one-to-many token mapping improves tremendously the initialization of the embedding table for the target language. We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian. These converted models achieve a new state-of-the-art performance on these languages across all sorts of downstream tasks. By reducing significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification