BiMix: A Bivariate Data Mixing Law for Language Model Pretraining
Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding

TL;DR
This paper introduces BiMix, a new bivariate data mixing law that models domain proportions and data volume in language model pretraining, enabling better understanding and optimization of data mixtures for improved model performance.
Contribution
BiMix provides a systematic framework for modeling and optimizing data mixtures in LLM pretraining, with high accuracy and generalization, along with entropy-based proxies for efficient data mixing assessment.
Findings
BiMix accurately predicts loss extrapolation with mean relative error < 0.2%.
It generalizes well to unseen data mixtures with R² > 0.97.
Optimized domain proportions improve model performance over existing methods.
Abstract
Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly understood. This paper introduces , a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining. provides a systematic framework for understanding and optimizing data mixtures across diverse domains. Through extensive experiments on two large-scale datasets, we demonstrate 's high accuracy in loss extrapolation (mean relative error < 0.2%) and its generalization to unseen mixtures (R > 0.97). Optimization of domain proportions yields superior model performance compared to existing methods. Furthermore, we establish…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
