Efficiently Adapting Pretrained Language Models To New Languages
Zoltan Csaki, Pian Pawakapan, Urmish Thakker, Qiantong Xu

TL;DR
This paper proposes a method to efficiently adapt pretrained language models to new languages by enhancing tokenizer encoding and data mixing, achieving better performance on low-resource languages with minimal impact on the original language.
Contribution
It introduces a novel approach to adapt existing LLMs to new languages effectively, addressing catastrophic forgetting and tokenizer inefficiency.
Findings
Improved performance on Hungarian and Thai with minimal English regression.
Enhanced tokenizer encoding efficiency through new token addition.
Effective data mixing strategies mitigate forgetting during adaptation.
Abstract
Recent large language models (LLM) exhibit sub-optimal performance on low-resource languages, as the training data of these models is usually dominated by English and other high-resource languages. Furthermore, it is challenging to train models for low-resource languages, especially from scratch, due to a lack of high quality training data. Adapting pretrained LLMs reduces the need for data in the new language while also providing cross lingual transfer capabilities. However, naively adapting to new languages leads to catastrophic forgetting and poor tokenizer efficiency. In this work, we study how to efficiently adapt any existing pretrained LLM to a new language without running into these issues. In particular, we improve the encoding efficiency of the tokenizer by adding new tokens from the target language and study the data mixing recipe to mitigate forgetting. Our experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗sambanovasystems/SambaLingo-Arabic-Basemodel· 24 dl· ♡ 3724 dl♡ 37
- 🤗sambanovasystems/SambaLingo-Bulgarian-Basemodel· 212 dl· ♡ 27212 dl♡ 27
- 🤗sambanovasystems/SambaLingo-Hungarian-Basemodel· 14 dl· ♡ 3014 dl♡ 30
- 🤗sambanovasystems/SambaLingo-Japanese-Basemodel· 11 dl· ♡ 2511 dl♡ 25
- 🤗sambanovasystems/SambaLingo-Russian-Basemodel· 114 dl· ♡ 32114 dl♡ 32
- 🤗sambanovasystems/SambaLingo-Slovenian-Basemodel· 10 dl· ♡ 2710 dl♡ 27
- 🤗sambanovasystems/SambaLingo-Serbian-Basemodel· 21 dl· ♡ 2821 dl♡ 28
- 🤗sambanovasystems/SambaLingo-Thai-Basemodel· 223 dl· ♡ 31223 dl♡ 31
- 🤗sambanovasystems/SambaLingo-Turkish-Basemodel· 26 dl· ♡ 3926 dl♡ 39
- 🤗sambanovasystems/SambaLingo-Arabic-Base-70Bmodel· 15 dl· ♡ 115 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
