Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time
Michael Y. Hu, Apurva Gandhi, Kyunghyun Cho, Tal Linzen, Pratyusha Sharma

TL;DR
This paper introduces OP-Mix, a unified online data mixing algorithm for language model training that improves performance and efficiency across pretraining and continual learning phases.
Contribution
It proposes a novel online data mixing method that operates throughout the entire training lifecycle, eliminating the need for separate proxy models.
Findings
OP-Mix improves pretraining perplexity by 6.3%.
OP-Mix reduces compute by 66-95% in continual learning.
OP-Mix consistently finds near-optimal data mixtures across tasks.
Abstract
Data mixing decides how to combine different sources or types of data and is a consequential problem throughout language model training. In pretraining, data composition is a key determinant of model quality; in continual learning and adaptation, it governs what is retained and acquired. Yet existing data mixing methods address only one phase of this lifecycle at a time: some require smaller proxy models tied to a single training phase, others assume a fixed domain set, and continual learning lacks principled guidance altogether. We argue that data mixing is fundamentally an online decision making problem -- one that recurs throughout training and demands a single, unified solution. We introduce OP-Mix (On-Policy Mix), a data mixing algorithm that operates across the entire language model training lifecycle. Our main insight is that candidate data mixtures can be cheaply simulated by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
