LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language
Cagri Toraman

TL;DR
This paper investigates methods to adapt large language models, mainly trained on English, for low-resource languages, comparing strategies like continual training, fine-tuning, and vocabulary extension, and analyzing their effectiveness.
Contribution
It systematically evaluates adaptation strategies for large language models to improve low-resource language performance, highlighting the effectiveness of continual training and task-specific fine-tuning.
Findings
Continual training improves language comprehension as shown by perplexity scores.
Task-specific fine-tuning enhances downstream task performance.
Vocabulary extension shows no significant benefits.
Abstract
Despite advancements in English-dominant generative large language models, further development is needed for low-resource languages to enhance global accessibility. The primary methods for representing these languages are monolingual and multilingual pretraining. Monolingual pretraining is expensive due to hardware requirements, and multilingual models often have uneven performance across languages. This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages. We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension. The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks. However, extending the vocabulary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
