Efficient Continual Pre-training of LLMs for Low-resource Languages

Arijit Nag; Soumen Chakrabarti; Animesh Mukherjee; Niloy Ganguly

arXiv:2412.10244·cs.CL·December 16, 2024

Efficient Continual Pre-training of LLMs for Low-resource Languages

Arijit Nag, Soumen Chakrabarti, Animesh Mukherjee, Niloy Ganguly

PDF

1 Video

TL;DR

This paper introduces efficient methods for continual pre-training of open-source large language models to improve performance on low-resource languages while significantly reducing data and computational costs.

Contribution

It proposes novel algorithms for selecting training data and vocabulary to enhance low-resource language modeling efficiently.

Findings

01

Effective text subset selection reduces CPT data needs.

02

Vocabulary augmentation improves low-resource language performance.

03

Experiments demonstrate significant gains on Indian languages.

Abstract

Open-source Large Language models (OsLLMs) propel the democratization of natural language research by giving the flexibility to augment or update model parameters for performance improvement. Nevertheless, like proprietary LLMs, Os-LLMs offer poorer performance on low-resource languages (LRLs) than high-resource languages (HRLs), owing to smaller amounts of training data and underrepresented vocabulary. On the other hand, continual pre-training (CPT) with large amounts of language-specific data is a costly proposition in terms of data acquisition and computational resources. Our goal is to drastically reduce CPT cost. To that end, we first develop a new algorithm to select a subset of texts from a larger corpus. We show the effectiveness of our technique using very little CPT data. In search of further improvement, we design a new algorithm to select tokens to include in the LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Efficient Continual Pre-training of LLMs for Low-resource Languages· underline