Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

Michael Y. Hu; Apurva Gandhi; Kyunghyun Cho; Tal Linzen; Pratyusha Sharma

arXiv:2605.15220·cs.CL·May 18, 2026

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

Michael Y. Hu, Apurva Gandhi, Kyunghyun Cho, Tal Linzen, Pratyusha Sharma

PDF

TL;DR

This paper introduces OP-Mix, a unified online data mixing algorithm for language model training that improves performance and efficiency across pretraining and continual learning phases.

Contribution

It proposes a novel online data mixing method that operates throughout the entire training lifecycle, eliminating the need for separate proxy models.

Findings

01

OP-Mix improves pretraining perplexity by 6.3%.

02

OP-Mix reduces compute by 66-95% in continual learning.

03

OP-Mix consistently finds near-optimal data mixtures across tasks.

Abstract

Data mixing decides how to combine different sources or types of data and is a consequential problem throughout language model training. In pretraining, data composition is a key determinant of model quality; in continual learning and adaptation, it governs what is retained and acquired. Yet existing data mixing methods address only one phase of this lifecycle at a time: some require smaller proxy models tied to a single training phase, others assume a fixed domain set, and continual learning lacks principled guidance altogether. We argue that data mixing is fundamentally an online decision making problem -- one that recurs throughout training and demands a single, unified solution. We introduce OP-Mix (On-Policy Mix), a data mixing algorithm that operates across the entire language model training lifecycle. Our main insight is that candidate data mixtures can be cheaply simulated by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.