R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
Albert Ge, Tzu-Heng Huang, John Cooper, Avi Trost, Ziyi Chu, Satya Sai, Srinath Namburi GNVV, Ziyang Cai, Kendall Park, Nicholas Roberts, Frederic, Sala

TL;DR
R&B introduces a novel framework for dynamic data regrouping and balancing based on semantic similarity and domain gradients, significantly improving foundation model training efficiency and performance without extra compute.
Contribution
It proposes a new data mixing method that adaptively re-partitions and balances training data based on semantic and gradient information, reducing computational costs.
Findings
R&B matches or exceeds state-of-the-art data mixing strategies.
Effective across diverse datasets including language, reasoning, and multimodal tasks.
Achieves these results with only 0.01% additional compute overhead.
Abstract
Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e.g., data sources, task types), which may fail to capture critical semantic nuances, leaving performance on the table. Second, these methods scale with the number of domains in a computationally prohibitive way. We address these challenges via R&B, a framework that re-partitions training data based on semantic similarity (Regroup) to create finer-grained domains, and efficiently optimizes the data composition (Balance) by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, it removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
