Data Mixing Laws: Optimizing Data Mixtures by Predicting Language   Modeling Performance

Jiasheng Ye; Peiju Liu; Tianxiang Sun; Jun Zhan; Yunhua Zhou; Xipeng; Qiu

arXiv:2403.16952·cs.CL·March 21, 2025·2 cites

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, Xipeng, Qiu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces data mixing laws that predict language model performance based on data mixture proportions, enabling optimal data selection and efficient training without extensive experimentation.

Contribution

It proposes a quantitative framework for predicting model performance from data mixtures, guiding optimal data selection and training strategies for large language models.

Findings

01

Accurately predicts performance of unseen data mixtures

02

Optimizes data mixture for a 1B model, reducing training steps by 48%

03

Extends to continual training, predicting critical mixture proportions

Abstract

Pretraining data of large language models composes multiple domains (e.g., web texts, academic papers, codes), whose mixture proportions crucially impact the competence of outcome models. While existing endeavors rely on heuristics or qualitative strategies to tune the proportions, we discover the quantitative predictability of model performance regarding the mixture proportions in function forms, which we refer to as the data mixing laws. Fitting such functions on sample mixtures unveils model performance on unseen mixtures before actual runs, thus guiding the selection of an ideal data mixture. Furthermore, we propose nested use of the scaling laws of training steps, model sizes, and our data mixing law to enable predicting the performance of large models trained on massive data under various mixtures with only small-scale training. Moreover, experimental results verify that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yegcjs/mixinglaws
pytorchOfficial

Videos

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Data Quality and Management