MATES: Model-Aware Data Selection for Efficient Pretraining with Data   Influence Models

Zichun Yu; Spandan Das; Chenyan Xiong

arXiv:2406.06046·cs.CL·November 19, 2024·2 cites

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Zichun Yu, Spandan Das, Chenyan Xiong

PDF

Open Access 1 Repo 2 Models 1 Video

TL;DR

MATES introduces a dynamic, model-aware data selection method for language model pretraining that adapts to the model's evolving needs, significantly improving efficiency and downstream performance.

Contribution

It proposes a novel data influence model that continuously adapts during pretraining, outperforming static methods and reducing computational costs.

Findings

01

Outperforms state-of-the-art data selection methods.

02

Doubles gains on downstream tasks compared to previous approaches.

03

Halves the FLOPs needed to reach target performance.

Abstract

Pretraining data selection has the potential to improve language model pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining. In this paper, we introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress. Specifically, we collect oracle data influence by locally probing the pretraining model and fine-tune a small data influence model to approximate it accurately. The data influence model then predicts data influence over the whole pretraining corpus and selects the most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cxcscmu/mates
pytorchOfficial

Models

Videos

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models· slideslive

Taxonomy

TopicsTime Series Analysis and Forecasting · Data Mining Algorithms and Applications · Advanced Database Systems and Queries

MethodsPythia