Adapting Multilingual LLMs to Low-Resource Languages using Continued   Pre-training and Synthetic Corpus

Raviraj Joshi; Kanishk Singla; Anusha Kamath; Raunak Kalani; Rakesh; Paul; Utkarsh Vaidya; Sanjay Singh Chauhan; Niranjan Wartikar; Eileen Long

arXiv:2410.14815·cs.CL·April 22, 2025

Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus

Raviraj Joshi, Kanishk Singla, Anusha Kamath, Raunak Kalani, Rakesh, Paul, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar, Eileen Long

PDF

Open Access 4 Models

TL;DR

This paper demonstrates that continued pre-training with synthetic and real data significantly improves multilingual LLMs' performance on low-resource languages, exemplified by Hindi, while maintaining English capabilities.

Contribution

It introduces a bilingual model trained with synthetic data and continued pre-training, achieving state-of-the-art results in Hindi and enhancing factual accuracy.

Findings

01

State-of-the-art Hindi benchmark performance

02

Improved factual accuracy in low-resource languages

03

Continued pre-training benefits beyond language alignment

Abstract

Multilingual LLMs support a variety of languages; however, their performance is suboptimal for low-resource languages. In this work, we emphasize the importance of continued pre-training of multilingual LLMs and the use of translation-based synthetic pre-training corpora for improving LLMs in low-resource languages. We conduct our study in the context of the low-resource Indic language Hindi. We introduce Nemotron-Mini-Hindi 4B, a bilingual SLM supporting both Hindi and English, based on Nemotron-Mini 4B. The model is trained using a mix of real and synthetic Hindi + English tokens, with continuous pre-training performed on 400B tokens. We demonstrate that both the base and instruct models achieve state-of-the-art results on Hindi benchmarks while remaining competitive on English tasks. Additionally, we observe that the continued pre-training approach enhances the model's overall…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsBalanced Selection