Efficiently Adapting Pretrained Language Models To New Languages

Zoltan Csaki; Pian Pawakapan; Urmish Thakker; Qiantong Xu

arXiv:2311.05741·cs.CL·December 18, 2023·1 cites

Efficiently Adapting Pretrained Language Models To New Languages

Zoltan Csaki, Pian Pawakapan, Urmish Thakker, Qiantong Xu

PDF

Open Access 10 Models

TL;DR

This paper proposes a method to efficiently adapt pretrained language models to new languages by enhancing tokenizer encoding and data mixing, achieving better performance on low-resource languages with minimal impact on the original language.

Contribution

It introduces a novel approach to adapt existing LLMs to new languages effectively, addressing catastrophic forgetting and tokenizer inefficiency.

Findings

01

Improved performance on Hungarian and Thai with minimal English regression.

02

Enhanced tokenizer encoding efficiency through new token addition.

03

Effective data mixing strategies mitigate forgetting during adaptation.

Abstract

Recent large language models (LLM) exhibit sub-optimal performance on low-resource languages, as the training data of these models is usually dominated by English and other high-resource languages. Furthermore, it is challenging to train models for low-resource languages, especially from scratch, due to a lack of high quality training data. Adapting pretrained LLMs reduces the need for data in the new language while also providing cross lingual transfer capabilities. However, naively adapting to new languages leads to catastrophic forgetting and poor tokenizer efficiency. In this work, we study how to efficiently adapt any existing pretrained LLM to a new language without running into these issues. In particular, we improve the encoding efficiency of the tokenizer by adding new tokens from the target language and study the data mixing recipe to mitigate forgetting. Our experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications