ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, Sayna Ebrahimi

TL;DR
This paper presents ATLAS, a new scaling law for multilingual models that improves understanding of how to effectively scale and transfer knowledge across many languages, based on extensive experiments.
Contribution
The paper introduces the Adaptive Transfer Scaling Law (ATLAS), a novel framework for multilingual pretraining and finetuning that generalizes better across languages and scales efficiently.
Findings
ATLAS outperforms existing scaling laws in generalization.
Cross-lingual transfer matrix quantifies language pair benefits.
Scaling laws guide optimal model size and data addition.
Abstract
Scaling laws research has focused overwhelmingly on English -- yet the most prominent AI models explicitly serve billions of international users. In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages. We introduce the Adaptive Transfer Scaling Law (ATLAS) for both monolingual and multilingual pretraining, which outperforms existing scaling laws' out-of-sample generalization often by more than 0.3 R^2. Our analyses of the experiments shed light on multilingual learning dynamics, transfer properties between languages, and the curse of multilinguality. First, we derive a cross-lingual transfer matrix, empirically measuring mutual benefit scores between 38 x 38=1444 language pairs. Second, we derive a language-agnostic scaling law…
Peer Reviews
Decision·ICLR 2026 Poster
- The problems the paper wants to tackle are important in the multilingual learning literature. Each section begins with a clear research question, which guides the reader through the narrative logically. - The work shows significant experimental efforts. Notably, the bilingual transfer table in Figure 2 is a valuable asset to the community of multilingual learning. - The findings offer actionable insights for multilingual model practitioners, especially the compute-optimal scaling frontier a
- It is unclear which part of the proposed law is a unique contribution of the authors, and which is adapted. For instance, it is known that equation (1) essentially follows the Chinchilla scaling law, but why do equations (2) and (3) take the given specific form? Although the fitting accuracy is high, it would be helpful to provide some justification about why the proposed law organizes those terms in the optimal way. If those constructions are an improvement or combination of previous scaling
• Systematic and comprehensive study of scaling law for multilingual models is an important topic. • Significant number of experiments are conducted. A few important findings are drawn (the evidence to support the claims requires additional attention though).
• Key definitions are missing for a few key concepts and key equations. For example, the symbols in equation (1) are not defined. The grounding of these symbols and equations are not available in this paper. The readers will need to look up the citations with significant efforts of guessing to understand the key idea in an inaccurate manner. • The formal model of scaling laws are not well defined in this paper. • The writing and structure of this paper doesn't meet the scientific paper qualit
1. The paper presents a new functional form for modeling multilingual setups, accounting for repeated tokens as well as data mixtures. The consequent scaling law has better predictive power compared to other baselines. 2. For me, the biggest contribution is the cross-lingual study for understanding transfer at scale: the significant number of pairwise experiments definitely help identify key factors for cross-lingual transfer, and additionally be an important resource for trying to further unde
While I think the paper make some really good contributions, I do have the following concerns: 1. The scaling law's functional form doesn't explain why it was chosen to be the functional form in the first place, compared to other ways of formulating the interaction. Concretely [1,2], both demonstrate that L(N, D, p) ~ L(N, D)*p exp(\gamma), which also requires that the term modeling the parameters should also depend on the proportion of language (a similar symmetrization argument was also made
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
