Training Bilingual LMs with Data Constraints in the Targeted Language

Skyler Seto; Maartje ter Hoeve; Richard He Bai; Natalie Schluter,; David Grangier

arXiv:2411.12986·cs.CL·February 7, 2025

Training Bilingual LMs with Data Constraints in the Targeted Language

Skyler Seto, Maartje ter Hoeve, Richard He Bai, Natalie Schluter,, David Grangier

PDF

Open Access

TL;DR

This paper investigates how to improve bilingual language models for low-resource languages by leveraging high-quality data from auxiliary languages, using translation and data upsampling techniques to enhance performance.

Contribution

It introduces methods for utilizing auxiliary language data and upsampling to boost performance in low-resource target languages without changing model architecture.

Findings

01

Auxiliary language data improves target language performance.

02

Translation systems facilitate cross-lingual transfer.

03

Data-rich English datasets benefit low-resource languages.

Abstract

Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high quality pretraining data is unavailable. In this work, we study how to boost pretrained model performance in a target language with insufficient pretraining data for training a high performing language model, by enlisting data from an auxiliary language for which high quality data is available. We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language, exploring the benefits of translation systems, studying the limitations of model scaling when data is limited in the target languages, and proposing new methods for upsampling data from the auxiliary language.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Interpreting and Communication in Healthcare · Text Readability and Simplification