Multilingual Language Model Pretraining using Machine-translated Data

Jiayi Wang; Yao Lu; Maurice Weber; Max Ryabinin; David Adelani; Yihong; Chen; Raphael Tang; Pontus Stenetorp

arXiv:2502.13252·cs.CL·February 20, 2025

Multilingual Language Model Pretraining using Machine-translated Data

Jiayi Wang, Yao Lu, Maurice Weber, Max Ryabinin, David Adelani, Yihong, Chen, Raphael Tang, Pontus Stenetorp

PDF

Open Access 3 Models 1 Datasets 1 Video

TL;DR

This paper demonstrates that machine-translated high-quality English data can significantly improve multilingual language models, achieving state-of-the-art results with less data than existing models.

Contribution

It introduces TransWebEdu, a large multilingual dataset created via machine translation, and trains TransWebLLM, a model that outperforms larger models on non-English tasks.

Findings

01

TransWebLLM matches or outperforms larger models like Llama3.2 on nine non-English tasks.

02

Adding less than 5% of TransWebEdu data sets new state-of-the-art in several languages.

03

Machine-translated data from a single source can enhance multilingual model performance.

Abstract

High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other languages as LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated texts from a single high-quality source language can contribute significantly to the pretraining quality of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into nine languages, resulting in a 1.7-trillion-token dataset, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this dataset. Across nine non-English reasoning tasks, we show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2, Qwen2.5,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

britllm/TransWebEdu
dataset· 1.5k dl
1.5k dl

Videos

Multilingual Language Model Pretraining using Machine-translated Data· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling