Aioli: A Unified Optimization Framework for Language Model Data Mixing
Mayee F. Chen, Michael Y. Hu, Nicholas Lourie, Kyunghyun Cho,, Christopher R\'e

TL;DR
Aioli introduces a new online optimization method for dynamically adjusting data mixture proportions in language model training, outperforming existing approaches and a simple stratified sampling baseline across multiple datasets.
Contribution
The paper unifies existing data mixing methods into a common framework and proposes Aioli, a novel online method that accurately estimates mixing laws to improve language model training.
Findings
Aioli outperforms stratified sampling on all tested datasets.
Existing methods often set mixing law parameters inaccurately.
Aioli improves performance in resource-constrained training scenarios.
Abstract
Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity. To understand this inconsistency, we unify existing methods into a standard framework, showing they are equivalent to solving a common optimization problem: minimize average loss subject to a method-specific mixing law -- an implicit assumption on the relationship between loss and mixture proportions. This framework suggests that measuring the fidelity of a method's mixing law can offer insights into its performance.…
Peer Reviews
Decision·ICLR 2025 Poster
1 The unified framework integrates several recent methods about data mixture on large language model. This part is well-written and easy to follow. 2 The problem it tries to tackle is of great importance. Related work propose different methods, and this paper try to unify them as a framework, which could advance future research in this topic. 3 The experiments provided are all necessary to support the claims.
1 Formulas and letters and extensive details make some contents not easy to follow. For instance, in section 5 (Estimating $A^{t\star}$), additional clarification on the core concepts prior to presenting numerous notations would enhance understanding. 2 Experimental settings are insufficient. All experiments utilize a 160M model with a maximum of 50K steps. In Section 4.1, when m=7, 10 different proportions are not enough to find the optimal proportion.
This article is well written and establishes a mathematical expression framework that unifies the data mixing methods mentioned in this paper. It reveals that the different data mixing methods mentioned in this paper are formally unified, with differences mainly concentrated in parameter estimation methods. At the same time, a targeted method for parameter estimation in the framework was proposed, which demonstrated effectiveness in practical experiments.
I am willing to acknowledge the contribution of the framework proposed in this paper, but the proposal of the framework focuses more on "induction" based on existing methods, which leads to a lack of persuasiveness in the innovation and practicality of the framework. Specifically, 1) the basic assumption of the framework is that the loss-proportion relationship is "linear" (or log linear). Although there has been experimental verification, the experiments are very empirical and the experimental
1. The paper is well-organized and easy to follow, with extensive experimental details. 2. This paper provides a unified framework for data mixing problems that explain the different of different methods, which is a significant conceptual contribution. 3. The experiment results confirm the method generally works well, shows improvement on 6 out of 6 datasets,
1. Although AIOLI improves stratified sampling on all evaluated data, it can underperform baseline methods in some datasets (e.g., A/B/SE in Tab. 2). 2. The authors propose the LearnParams Algorithm to estimate A* without sweeping, but its accuracy is unclear. It is also unclear whether the simulating process affects model performance. 3. AIOLI involves a number of hyperparameters while a description of how to decide these hyperparameters is insufficient. This may hinder the application AIOLI.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
MethodsSparse Evolutionary Training
