Olmix: A Framework for Data Mixing Throughout LM Development
Mayee F. Chen, Tyler Murray, David Heineman, Matt Jordan, Hannaneh Hajishirzi, Christopher R\'e, Luca Soldaini, Kyle Lo

TL;DR
Olmix introduces a comprehensive framework for data mixing in language model development, addressing design choices and evolving domain sets, leading to efficient recomputation and improved downstream performance.
Contribution
The paper provides a systematic empirical study of data mixing design choices and proposes mixture reuse to efficiently update data mixes during LM development.
Findings
Mixture reuse reduces recomputation by 74%.
Mixing improves downstream task performance by 11.6%.
Identifies key design choices for effective data mixing.
Abstract
Data mixing -- determining the ratios of data from different domains -- is a first-order concern for training language models (LMs). While existing mixing methods show promise, they fall short when applied during real-world LM development. We present Olmix, a framework that addresses two such challenges. First, the configuration space for developing a mixing method is not well understood -- design choices across existing methods lack justification or consensus and overlook practical issues like data constraints. We conduct a comprehensive empirical study of this space, identifying which design choices lead to a strong mixing method. Second, in practice, the domain set evolves throughout LM development as datasets are added, removed, partitioned, and revised -- a problem setting largely unaddressed by existing works, which assume fixed domains. We study how to efficiently recompute the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
