OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang

TL;DR
OPUS introduces a dynamic data selection method for large language model pre-training that leverages optimizer-induced updates, improving efficiency and performance across various settings with minimal additional computational cost.
Contribution
OPUS is a novel, optimizer-aware data selection framework that enhances pre-training efficiency by projecting candidate updates onto a stable target direction, outperforming static filters and full data training.
Findings
Outperforms baselines in GPT-2 pre-training on diverse corpora.
Achieves comparable or better results than full 200B-token training with fewer tokens.
Improves data efficiency in domain-specific continued pre-training.
Abstract
As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification
