RegMix: Data Mixture as Regression for Language Model Pre-training
Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu, Dou, Tianyu Pang, Jing Jiang, Min Lin

TL;DR
RegMix introduces an automated regression-based approach to identify optimal data mixtures for large language model pre-training, significantly improving performance and efficiency over manual selection methods.
Contribution
It formulates data mixture selection as a regression task, enabling automatic identification of high-performing mixtures for large-scale language model training.
Findings
Data mixtures greatly influence model performance.
Web data correlates more with success than high-quality sources.
Automatic methods outperform human selection and match advanced algorithms.
Abstract
The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix trains many small models on diverse data mixtures, uses regression to predict performance of unseen mixtures, and applies the best predicted mixture to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens to fit the regression model and predict the best data mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Furthermore, RegMix consistently outperforms human selection in experiments…
Peer Reviews
Decision·ICLR 2025 Spotlight
- The paper introduces a novel, regression-based approach for selecting data mixtures that reduces computational costs in language model training. The approach offers an efficient alternative to traditional dynamic or heuristic data allocation methods, making a valuable contribution to the field. - The paper is technically robust and well-structured, with extensive validation across diverse data scenarios. It empirically supports the rank invariance hypothesis and uses clear, well-structured fi
- To maximize impact, the authors could highlight specific scenarios where the approach enables previously infeasible experiments due to resource constraints. Also, adding a broader discussion on trade-offs of the method (e.g., scenarios where the rank invariance assumption might not hold) would help readers assess its practical relevance and future applicability. - The work could have used standardized computation metrics, such as FLOPs or GPU hours, to allow clearer comparison of the method e
1. RegMix introduces a fresh approach by framing data mixture selection as a regression problem rather than relying on complex optimizations or heuristics, making the process scalable and computationally efficient. 2. The paper’s experimental setup is robust, with 512 small proxy models across diverse data mixtures, creating a solid regression model for data selection. 3. The paper is well-organized, clearly explaining the methodology and experiments. It introduces the hypothesis of rank invar
The paper conducts a set of small-proxy models trained with small-scale tokens. The paper only experiments with 1M models with 1B tokens. It is unclear how to decide the size of the proxy model parameter and training token.
- The paper presents a novel method, REGMIX, which formulates the data mixture selection problem as a regression task. This is a creative approach that leverages small-scale proxy models to predict optimal data mixtures for large-scale models. - The authors conducted extensive experiments, training 512 models with 1M parameters on 1B tokens to fit the regression model. They then validated this model by training a 1B parameter model on 25B tokens, showing superior performance compared to human se
- The key assumption of rank invariance of data mixtures across different model sizes and token counts is not thoroughly validated. This assumption might not hold in all cases, especially with significant changes in model scale and data distribution. - The paper claims stability across different proxy model sizes, but the experiments are limited to models with up to 1B parameters. It remains unclear if the method would be equally effective for much larger models commonly used in practice (e.g.,
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training
