Bilingual Adaptation of Monolingual Foundation Models
Gurpreet Gosal, Yishi Xu, Gokul Ramakrishnan, Rituraj Joshi, Avraham, Sheinin, Zhiming (Charles) Chen, Biswajit Mishra, Natalia Vassilieva, Joel, Hestness, Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Onkar Pandit, Satheesh, Katipomu, Samta Kamboj, Samujjwal Ghosh, Rahul Pal

TL;DR
This paper introduces a two-stage method for adapting monolingual LLMs to new languages, effectively balancing language retention and acquisition, demonstrated on Llama models for Arabic and Hindi.
Contribution
The study proposes a novel two-stage adaptation process involving vocabulary expansion and continual pre-training, improving cross-lingual transfer for large language models.
Findings
Significant improvement in Arabic language capabilities
Slight enhancement in English proficiency
Effective adaptation demonstrated on multiple Llama models
Abstract
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language, addressing challenges of catastrophic forgetting and tokenizer limitations. We focus this study on adapting Llama 2 to Arabic. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix, followed by full model continual pre-training on a bilingual corpus. By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic. Our approach results in significant improvements in Arabic and slight enhancements in English, demonstrating cost-effective cross-lingual transfer. We perform ablations on embedding initialization techniques, data mix ratios, and learning rates and release a detailed training recipe. To demonstrate generalizability of this approach we also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsFocus · LLaMA
