Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation
Samin Mahdizadeh Sani, Pouya Sadeghi, Thuy-Trang Vu, Yadollah, Yaghoobzadeh, Gholamreza Haffari

TL;DR
This paper investigates extending Llama, a large language model, to support Persian through parameter-efficient fine-tuning, multi-stage training, and bilingual data alignment, demonstrating improved Persian task performance with minimal impact on English tasks.
Contribution
It introduces a multi-stage approach for adding Persian to Llama, combining monolingual pretraining, bilingual alignment, and instruction-tuning, highlighting the effectiveness of bilingual data in low-resource language adaptation.
Findings
Incorporating Persian improves classification accuracy for Persian tasks.
Bilingual data alignment enhances low-resource language performance.
Cross-lingual transfer has limited benefits for Persian with minimal training data.
Abstract
Large language models (LLMs) have made great progress in classification and text generation tasks. However, they are mainly trained on English data and often struggle with low-resource languages. In this study, we explore adding a new language, i.e., Persian, to Llama (a model with a limited understanding of Persian) using parameter-efficient fine-tuning. We employ a multi-stage approach involving pretraining on monolingual Persian data, aligning representations through bilingual pretraining and instruction datasets, and instruction-tuning with task-specific datasets. We evaluate the model's performance at each stage on generation and classification tasks. Our findings suggest that incorporating the Persian language, through bilingual data alignment, can enhance classification accuracy for Persian tasks, with no adverse impact and sometimes even improvements on English tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTranslation Studies and Practices · Natural Language Processing Techniques
MethodsLLaMA
