Scaling Laws for Multilingual Language Models

Yifei He; Alon Benhaim; Barun Patra; Praneetha Vaddamanu; Sanchit; Ahuja; Parul Chopra; Vishrav Chaudhary; Han Zhao; Xia Song

arXiv:2410.12883·cs.CL·December 5, 2024

Scaling Laws for Multilingual Language Models

Yifei He, Alon Benhaim, Barun Patra, Praneetha Vaddamanu, Sanchit, Ahuja, Parul Chopra, Vishrav Chaudhary, Han Zhao, Xia Song

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new scaling law for multilingual language models that predicts performance based on dataset size, model size, and sampling ratios, simplifying analysis and optimizing training across many languages.

Contribution

The paper presents a novel scaling law for multilingual LMs that relates performance to dataset size, model size, and sampling ratios, validated through extensive experiments.

Findings

01

Performance can be predicted by a power-law relationship.

02

Optimal sampling ratios generalize across model sizes.

03

Large models benefit from the proposed scaling law for efficient training.

Abstract

We propose a novel scaling law for general-purpose decoder-only language models (LMs) trained on multilingual data, tackling the problem of balancing languages during multilingual pretraining. A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer. To address this, we shift the focus from individual languages to language families. We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio, independent of other languages in the mixture. This insight simplifies the complexity of multilingual scaling and make the analysis scalable to an arbitrary number of languages. Building on this hypothesis, we derive a power-law relationship that links performance with dataset size, model size and sampling ratios. This…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 2

Strengths

An overlooked area in the field. Multilingual Language models are very important. Lots of experiments.

Weaknesses

A bit dense (especially some captions of figures and tables that took me a while to figure out. For instance my thoughts about "Figure 1 Model size N figure is a bit confusing to me. What is N? 0,2,4,8 of what? Oh, 8 is 85Million parameters. It could be a bit clearer in the caption." Figure 5 left and middle … a bit hard to know. Y-axis is loss (for a sampling ratio?) X-axis is sampling ratio? Colored lines are model sizes. Data size is just difference between left and middle graph? The captio

Reviewer 02Rating 6Confidence 3

Strengths

- The scaling law proposed is more fit for multilingual pre-training by considering cross-lingual transfer. - There is an interesting finding that $\gamma_{i}(N, D)$ is independence from N and D. - The optimal sampling ratios inferred can bring marginal improvement (< 0.1) than the three naive baselines.

Weaknesses

- **Missing important baselines**: The three baselines in this paper are naive. More baseline methods like the Equation (7) in Fernandes et al., 2023 are not incorporated. - **More experiments needed**: The minimal cross-group transfer hypothesis (hypothesis 1) is hard to quantify and meet. It is still far from the real phenomenon. To support hypothesis 1, experiments are only conducted on (Romance, Indic) and (Sino-Tibetan, Slavic) language family pairs with varying three sampling ratios in {0

Reviewer 03Rating 5Confidence 3

Strengths

S1. The proposed scaling law could simplify the complexity of cross-lingual interactions during training, offering a meaningful extension of existing work that focuses primarily on monolingual or bilingual models. This decoupling from individual languages could also reduce the computational burden. S2. The paper also proposed an optimal sampling ratio as a practical data-mixing strategy, which could be applicable for real-world multilingual model training. S3. The paper is clear in its present

Weaknesses

W1. The primary issue with the paper is its focus on the generality of the proposed scaling law without applying it to specific downstream tasks (beyond machine translation as it is claimed). Testing it on downstream tasks like summarization or question answering, on metrics that are widely used to assess the performance of each downstream task, would have strengthened the claim that this scaling law is for a general-purpose decoder multilingual language model. It also provides more solid eviden

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsFocus