Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus
Raviraj Joshi, Kanishk Singla, Anusha Kamath, Raunak Kalani, Rakesh, Paul, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar, Eileen Long

TL;DR
This paper demonstrates that continued pre-training with synthetic and real data significantly improves multilingual LLMs' performance on low-resource languages, exemplified by Hindi, while maintaining English capabilities.
Contribution
It introduces a bilingual model trained with synthetic data and continued pre-training, achieving state-of-the-art results in Hindi and enhancing factual accuracy.
Findings
State-of-the-art Hindi benchmark performance
Improved factual accuracy in low-resource languages
Continued pre-training benefits beyond language alignment
Abstract
Multilingual LLMs support a variety of languages; however, their performance is suboptimal for low-resource languages. In this work, we emphasize the importance of continued pre-training of multilingual LLMs and the use of translation-based synthetic pre-training corpora for improving LLMs in low-resource languages. We conduct our study in the context of the low-resource Indic language Hindi. We introduce Nemotron-Mini-Hindi 4B, a bilingual SLM supporting both Hindi and English, based on Nemotron-Mini 4B. The model is trained using a mix of real and synthetic Hindi + English tokens, with continuous pre-training performed on 400B tokens. We demonstrate that both the base and instruct models achieve state-of-the-art results on Hindi benchmarks while remaining competitive on English tasks. Additionally, we observe that the continued pre-training approach enhances the model's overall…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsBalanced Selection
