Revisiting Multilingual Data Mixtures in Language Model Pretraining

Negar Foroutan; Paul Teiletche; Ayush Kumar Tarun; Antoine Bosselut

arXiv:2510.25947·cs.CL·October 31, 2025

Revisiting Multilingual Data Mixtures in Language Model Pretraining

Negar Foroutan, Paul Teiletche, Ayush Kumar Tarun, Antoine Bosselut

PDF

3 Reviews

TL;DR

This study investigates how diverse multilingual data mixtures affect large language models, revealing that balanced multilingual training can improve performance without the expected trade-offs, challenging common beliefs about the curse of multilinguality.

Contribution

It provides empirical evidence that combining multilingual data with English does not harm in-language performance and highlights the benefits of using English as a pivot language in multilingual pretraining.

Findings

01

Combining English and multilingual data does not degrade in-language performance.

02

Using English as a pivot language benefits multiple language families.

03

No significant curse of multilinguality observed as number of languages increases.

Abstract

The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language coverage and model performance (i.e., the curse of multilinguality). In this work, we investigate these assumptions by training 1.1B and 3B parameter LLMs on diverse multilingual corpora, varying the number of languages from 25 to 400. Our study challenges common beliefs surrounding multilingual training. First, we find that combining English and multilingual data does not necessarily degrade the in-language performance of either group, provided that languages have a sufficient number of tokens included in the pretraining corpus. Second, we observe that using English as a pivot language (i.e., a high-resource language that serves as a catalyst for multilingual generalization) yields benefits…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

The paper provides a “short survey” of several prior results on how to design multilingual data mixtures for training multilingual language models. The paper then provides evidence contrary to important beliefs propagated by these prior papers. The paper is very well written and organised, with clear assumptions and takeaways from each experiment.

Weaknesses

While this paper examines several prior assumptions about multilingual data mixtures, each assumption is not necessarily comprehensively examined. Further, the paper does not provide enough experiments to show *why* its results differ from prior works.

Reviewer 02Rating 8Confidence 4

Strengths

1. Tackles an important yet relatively understudied topic of multilingual fairness in large language models. 2. The paper is well written. The narrative is clear, and the experiments are comprehensive and carefully designed. Overall, this is a strong paper that will be of interest to the community. I recommend acceptance, assuming the minor concerns outlined below are addressed during the rebuttal.

Weaknesses

1. The pivot language experiments are limited to the Cyrillic and Slavic languages. It would strengthen the paper to include other language families to confirm the generality of the results, especially given that one of the paper’s main claimed strengths is its broad coverage of languages. 2. Table 1 could be made stronger by studying varying percentages of English instead of keeping it fixed at 40%. If the claim is that beyond a certain English data threshold, the number of additional languages

Reviewer 03Rating 4Confidence 4

Strengths

1. The study investigates a number of different research questions that can be taken into consideration when pre-training large multilingual models. 2. The paper studies models at two different parameter sizes, which improves the generality of the findings. 3. The study investigates scaling with different numbers of languages, including a larger number of languages. 4. Ablation studies are generally well-thought out, such as comparing fixed total and fixed multilingual budget or s

Weaknesses

As a whole, I have concerns regarding the relevance of the studied assumptions and as a result the novelty of the corresponding insights. Re #1: recent closed-source and open-source models have much improved English and multilingual performance, which indicates that more English data does not come at the cost of performance in other languages (up to a %), contrary to the assumption stated in the paper. Re #3: curriculum learning has not been used in the pre-training of a state-of-the-art LLM a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.