Multilingual Pretraining Using a Large Corpus Machine-Translated from a   Single Source Language

Jiayi Wang; Yao Lu; Maurice Weber; Max Ryabinin; Yihong Chen; Raphael; Tang; Pontus Stenetorp

arXiv:2410.23956·cs.CL·November 7, 2024

Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

Jiayi Wang, Yao Lu, Maurice Weber, Max Ryabinin, Yihong Chen, Raphael, Tang, Pontus Stenetorp

PDF

Open Access

TL;DR

This paper demonstrates that machine-translated web data from a single high-quality source language can effectively pretrain multilingual LLMs, achieving competitive performance with significantly less data than existing models.

Contribution

It introduces a new multilingual pretraining dataset created via machine translation from English and shows that models trained on it perform on par or better than larger models trained on more data.

Findings

01

CuatroLLM matches or outperforms state-of-the-art multilingual models.

02

Using less than 6% of the data, CuatroLLM achieves comparable results.

03

Additional domain-specific pretraining improves multilingual reasoning performance.

Abstract

English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). The same cannot be said for most other languages, as leading LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated text from a single high-quality source language can contribute significantly to the pretraining of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into French, German, and Spanish, resulting in a final 300B-token dataset, which we call TransWeb-Edu, and pretrain a 1.3B-parameter model, CuatroLLM, from scratch on this dataset. Across five non-English reasoning tasks, we show that CuatroLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques