MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Zichun Yu, Spandan Das, Chenyan Xiong

TL;DR
MATES introduces a dynamic, model-aware data selection method for language model pretraining that adapts to the model's evolving needs, significantly improving efficiency and downstream performance.
Contribution
It proposes a novel data influence model that continuously adapts during pretraining, outperforming static methods and reducing computational costs.
Findings
Outperforms state-of-the-art data selection methods.
Doubles gains on downstream tasks compared to previous approaches.
Halves the FLOPs needed to reach target performance.
Abstract
Pretraining data selection has the potential to improve language model pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining. In this paper, we introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress. Specifically, we collect oracle data influence by locally probing the pretraining model and fine-tune a small data influence model to approximate it accurately. The data influence model then predicts data influence over the whole pretraining corpus and selects the most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTime Series Analysis and Forecasting · Data Mining Algorithms and Applications · Advanced Database Systems and Queries
MethodsPythia
