Rethinking Data Mixing from the Perspective of Large Language Models

Yuanjian Xu; Tianze Sun; Changwei Xu; XinLong Zhao; Jianing Hao; Ran Chen; Yang Liu; Ruijie Xu; Stephen Chen; Guang Zhang

arXiv:2604.07963·cs.CL·April 10, 2026

Rethinking Data Mixing from the Perspective of Large Language Models

Yuanjian Xu, Tianze Sun, Changwei Xu, XinLong Zhao, Jianing Hao, Ran Chen, Yang Liu, Ruijie Xu, Stephen Chen, Guang Zhang

PDF

TL;DR

This paper provides a theoretical framework for understanding data mixing in large language model training, introduces a new reweighting method called DoGraph, and demonstrates its effectiveness through extensive experiments.

Contribution

It establishes a formal connection between gradient dynamics and domain distributions, and proposes DoGraph, a novel data reweighting framework for improved LLM training.

Findings

01

DoGraph achieves consistent performance improvements across GPT-2 models.

02

Theoretical analysis clarifies the role of domain weighting in training dynamics.

03

Empirical results show the effectiveness of the proposed reweighting framework.

Abstract

Data mixing strategy is essential for large language model (LLM) training. Empirical evidence shows that inappropriate strategies can significantly reduce generalization. Although recent methods have improved empirical performance, several fundamental questions remain open: what constitutes a domain, whether human and model perceptions of domains are aligned, and how domain weighting influences generalization. We address these questions by establishing formal connections between gradient dynamics and domain distributions, offering a theoretical framework that clarifies the role of domains in training dynamics. Building on this analysis, we introduce DoGraph, a reweighting framework that formulates data scheduling as a graph-constrained optimization problem. Extensive experiments on GPT-2 models of varying scales demonstrate that DoGraph consistently achieves competitive performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.