Rethinking Data Mixing from the Perspective of Large Language Models
Yuanjian Xu, Tianze Sun, Changwei Xu, XinLong Zhao, Jianing Hao, Ran Chen, Yang Liu, Ruijie Xu, Stephen Chen, Guang Zhang

TL;DR
This paper provides a theoretical framework for understanding data mixing in large language model training, introduces a new reweighting method called DoGraph, and demonstrates its effectiveness through extensive experiments.
Contribution
It establishes a formal connection between gradient dynamics and domain distributions, and proposes DoGraph, a novel data reweighting framework for improved LLM training.
Findings
DoGraph achieves consistent performance improvements across GPT-2 models.
Theoretical analysis clarifies the role of domain weighting in training dynamics.
Empirical results show the effectiveness of the proposed reweighting framework.
Abstract
Data mixing strategy is essential for large language model (LLM) training. Empirical evidence shows that inappropriate strategies can significantly reduce generalization. Although recent methods have improved empirical performance, several fundamental questions remain open: what constitutes a domain, whether human and model perceptions of domains are aligned, and how domain weighting influences generalization. We address these questions by establishing formal connections between gradient dynamics and domain distributions, offering a theoretical framework that clarifies the role of domains in training dynamics. Building on this analysis, we introduce DoGraph, a reweighting framework that formulates data scheduling as a graph-constrained optimization problem. Extensive experiments on GPT-2 models of varying scales demonstrate that DoGraph consistently achieves competitive performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
