Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning
Ziqing Fan, Yuqiao Xian, Yan Sun, Li Shen

TL;DR
This paper introduces DATAMASK, a policy gradient-based framework for efficient joint selection of quality and diversity metrics in large-scale pre-training data, significantly reducing computation time and improving model performance.
Contribution
It proposes a novel mask learning approach for joint data selection, enabling efficient optimization of multiple metrics at trillion-scale datasets.
Findings
Reduces selection time by 98.9% compared to greedy algorithms.
Selects about 10% of 15 trillion tokens, called FineWeb-Mask.
Achieves up to 3.2% performance improvement on large models.
Abstract
A fine-grained data recipe is crucial for pre-training large language models, as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by defined rules, LLM judgment, or statistical information in embeddings, which can be roughly categorized into quality and diversity metrics. Due to the high computational cost when applied to trillion-scale token pre-training datasets such as FineWeb and DCLM, these two or more types of metrics are rarely considered jointly in a single selection process. However, in our empirical study, selecting samples based on quality metrics exhibit severe diminishing returns during long-term pre-training, while selecting on diversity metrics removes too many valuable high-quality samples, both of which limit pre-trained LLMs' capabilities. Therefore, we introduce…
Peer Reviews
Decision·ICLR 2026 Poster
Originality: The paper addresses a critical problem, namely the trade-off between quality and diversity in LLM pre-training data selection. The idea of treating data selection as a mask learning problem and using policy gradients for optimization is novel. It moves beyond traditional greedy selection strategies, offering a unified learning-based approach. Quality: The paper is supported by rigorous empirical validation, including large-scale experiments on trillion-token datasets. Ablation studi
1. Partial Ablation of Core Parameters While the paper ablates diversity metrics, the balancing hyperparameter (λ), and the group size (G) in policy gradient estimation, it lacks systematic exploration of other key hyperparameters, such as the learning rate, the number of update epochs, and the initialization strategies for logits. This omission limits the understanding of the method's robustness and sensitivity to its full configuration. 2. Insufficient Accessibility and Clarity in Figures Figu
- There is an inherent trade-off between generality and specificity that has not been considered in existing related work. - I appreciate the fomalized approach that provides users with a more principled way of data curation. I believe such techniques are particularly valuable in increasing the sample efficency during pre-training and ultimately driving down cost. - The transparent cost breakdown helps others estimate whether datamask is a useful (and affordable) technique for their individual
- When arguing about pre-training the proposed dataset the FineWeb-Mask rather small for fully training 7/8B parameter (dense) or even larger models. SOTA 8b dense models are typically trained on 10T+ tokens. I could see the dataset to be applicable for what sometimes is refered to stage-two pre-training, i.e., showing documents to a model that contain desirable information for later post-training steps that require versatility. Exploring how well the specificity-/generality-balance introduced t
Novelty for the problem definition: The paper conceptualizes the large-scale data selection problem into a learnable mask optimization task and use policy gradient-based optimization and various acceleration enhancements to optimize the selection speed. Strong motivation and empirical analysis: The paper demonstrates the fundamental limitations of single-metric selection on large scale pre-training dataset, and use the visualizations to express the conflict of data quality and diversity that su
Limited methodological originality: The novelty is incremental, not a fundamentally new method. The framework of combination of Mask Learning and Policy Gradient is a direct application of Reinforce-style policy gradient to a combinatorial subset selection problem. Similar implementation have been used in: RL-based data pruning or sample selection (e.g., RLDataSampler, ICML 2022);Differentiable subset selection in vision and NLP (e.g., DPPNet, CVPR 2021; SubsetFormer, NeurIPS 2023). Lack of fai
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification
