LlamaTurk: Adapting Open-Source Generative Large Language Models for   Low-Resource Language

Cagri Toraman

arXiv:2405.07745·cs.CL·May 14, 2024·1 cites

LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language

Cagri Toraman

PDF

Open Access 1 Repo

TL;DR

This paper investigates methods to adapt large language models, mainly trained on English, for low-resource languages, comparing strategies like continual training, fine-tuning, and vocabulary extension, and analyzing their effectiveness.

Contribution

It systematically evaluates adaptation strategies for large language models to improve low-resource language performance, highlighting the effectiveness of continual training and task-specific fine-tuning.

Findings

01

Continual training improves language comprehension as shown by perplexity scores.

02

Task-specific fine-tuning enhances downstream task performance.

03

Vocabulary extension shows no significant benefits.

Abstract

Despite advancements in English-dominant generative large language models, further development is needed for low-resource languages to enhance global accessibility. The primary methods for representing these languages are monolingual and multilingual pretraining. Monolingual pretraining is expensive due to hardware requirements, and multilingual models often have uneven performance across languages. This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages. We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension. The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks. However, extending the vocabulary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

metunlp/llamaturk
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis