Data Mixing for Large Language Models Pretraining: A Survey and Outlook
Zhuo Chen, Yuxuan Miao, Supryadi, Deyi Xiong

TL;DR
This survey comprehensively reviews data mixing techniques for large language model pretraining, formalizing the problem, categorizing methods, analyzing trade-offs, and outlining future research directions.
Contribution
It provides a systematic taxonomy of data mixing methods, formalizes the optimization problem, and discusses challenges and future directions in LLM pretraining.
Findings
Data mixing methods are categorized into static and dynamic types.
Trade-offs exist between performance gains and cost control.
Challenges include transferability, evaluation standards, and optimization objectives.
Abstract
Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget constraints. Unlike sample-level data selection, data mixing optimizes domain-level sampling weights to allocate limited budgets more effectively. In recent years, a growing body of work has proposed principled data mixing methods for LLM pretraining; however, the literature remains fragmented and lacks a dedicated, systematic survey. This paper provides a comprehensive review of data mixing for LLM pretraining. We first formalize data mixture optimization as a bilevel problem on the probability simplex and clarify the role of data mixing in the pretraining pipeline, and briefly explain how existing methods make this formulation tractable in practice. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
