RegMix: Data Mixture as Regression for Language Model Pre-training

Qian Liu; Xiaosen Zheng; Niklas Muennighoff; Guangtao Zeng; Longxu; Dou; Tianyu Pang; Jing Jiang; Min Lin

arXiv:2407.01492·cs.CL·January 24, 2025·2 cites

RegMix: Data Mixture as Regression for Language Model Pre-training

Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu, Dou, Tianyu Pang, Jing Jiang, Min Lin

PDF

Open Access 1 Repo 5 Models 2 Datasets 3 Reviews

TL;DR

RegMix introduces an automated regression-based approach to identify optimal data mixtures for large language model pre-training, significantly improving performance and efficiency over manual selection methods.

Contribution

It formulates data mixture selection as a regression task, enabling automatic identification of high-performing mixtures for large-scale language model training.

Findings

01

Data mixtures greatly influence model performance.

02

Web data correlates more with success than high-quality sources.

03

Automatic methods outperform human selection and match advanced algorithms.

Abstract

The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix trains many small models on diverse data mixtures, uses regression to predict performance of unseen mixtures, and applies the best predicted mixture to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens to fit the regression model and predict the best data mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Furthermore, RegMix consistently outperforms human selection in experiments…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 4

Strengths

- The paper introduces a novel, regression-based approach for selecting data mixtures that reduces computational costs in language model training. The approach offers an efficient alternative to traditional dynamic or heuristic data allocation methods, making a valuable contribution to the field. - The paper is technically robust and well-structured, with extensive validation across diverse data scenarios. It empirically supports the rank invariance hypothesis and uses clear, well-structured fi

Weaknesses

- To maximize impact, the authors could highlight specific scenarios where the approach enables previously infeasible experiments due to resource constraints. Also, adding a broader discussion on trade-offs of the method (e.g., scenarios where the rank invariance assumption might not hold) would help readers assess its practical relevance and future applicability. - The work could have used standardized computation metrics, such as FLOPs or GPU hours, to allow clearer comparison of the method e

Reviewer 02Rating 8Confidence 3

Strengths

1. RegMix introduces a fresh approach by framing data mixture selection as a regression problem rather than relying on complex optimizations or heuristics, making the process scalable and computationally efficient. 2. The paper’s experimental setup is robust, with 512 small proxy models across diverse data mixtures, creating a solid regression model for data selection. 3. The paper is well-organized, clearly explaining the methodology and experiments. It introduces the hypothesis of rank invar

Weaknesses

The paper conducts a set of small-proxy models trained with small-scale tokens. The paper only experiments with 1M models with 1B tokens. It is unclear how to decide the size of the proxy model parameter and training token.

Reviewer 03Rating 6Confidence 3

Strengths

- The paper presents a novel method, REGMIX, which formulates the data mixture selection problem as a regression task. This is a creative approach that leverages small-scale proxy models to predict optimal data mixtures for large-scale models. - The authors conducted extensive experiments, training 512 models with 1M parameters on 1B tokens to fit the regression model. They then validated this model by training a 1B parameter model on 25B tokens, showing superior performance compared to human se

Weaknesses

- The key assumption of rank invariance of data mixtures across different model sizes and token counts is not thoroughly validated. This assumption might not hold in all cases, especially with significant changes in model scale and data distribution. - The paper claims stability across different proxy model sizes, but the experiments are limited to models with up to 1B parameters. It remains unclear if the method would be equally effective for much larger models commonly used in practice (e.g.,

Code & Models

Repositories

sail-sg/regmix
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training