MixMin: Finding Data Mixtures via Convex Minimization
Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. Maddison

TL;DR
MixMin introduces a convex optimization approach to find optimal data mixtures for machine learning, leading to consistent improvements across language and chemistry tasks with minimal additional compute.
Contribution
The paper formalizes the data mixing problem as a convex bi-level optimization and develops MixMin, a gradient-based method that improves data mixtures effectively.
Findings
MixMin improved language modeling performance by 1-5% in negative log likelihood.
MixMin enhanced bioassay prediction accuracy with 0.03-0.15 increase in average precision.
MixMin was the only method to consistently improve data mixtures across all experiments.
Abstract
Modern machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources, e.g., pre-training large language models. Yet, finding the optimal data mixture is a challenging and open problem. We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective. Unfortunately, this objective is generally intractable. In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger. We develop and study a gradient-based approach for optimizing this convex objective, which we call MixMin, and test it on language modeling and chemistry tasks. MixMin was the only method that uniformly improved the data mixture in all our experiments. With MixMin, we improved the data mixture using less than 0.2% additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Clustering Algorithms Research · Bayesian Methods and Mixture Models
