OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Shaobo Wang; Xuan Ouyang; Tianyi Xu; Yuzheng Hu; Jialin Liu; Guo Chen; Tianyu Zhang; Junhao Zheng; Kexin Yang; Xingzhang Ren; Dayiheng Liu; Linfeng Zhang

arXiv:2602.05400·cs.CL·February 10, 2026

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang

PDF

Open Access

TL;DR

OPUS introduces a dynamic data selection method for large language model pre-training that leverages optimizer-induced updates, improving efficiency and performance across various settings with minimal additional computational cost.

Contribution

OPUS is a novel, optimizer-aware data selection framework that enhances pre-training efficiency by projecting candidate updates onto a stable target direction, outperforming static filters and full data training.

Findings

01

Outperforms baselines in GPT-2 pre-training on diverse corpora.

02

Achieves comparable or better results than full 200B-token training with fewer tokens.

03

Improves data efficiency in domain-specific continued pre-training.

Abstract

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification