Scalable Data Ablation Approximations for Language Models through   Modular Training and Merging

Clara Na; Ian Magnusson; Ananya Harsh Jha; Tom Sherborne and; Emma Strubell; Jesse Dodge; Pradeep Dasigi

arXiv:2410.15661·cs.CL·December 10, 2024

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging

Clara Na, Ian Magnusson, Ananya Harsh Jha, Tom Sherborne and, Emma Strubell, Jesse Dodge, Pradeep Dasigi

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

This paper introduces a scalable, efficient method for approximating data ablations in large language models by training on data subsets and reusing models, enabling cost-effective evaluation of data mixtures.

Contribution

It proposes a novel modular training and merging approach that significantly reduces the cost of data ablation studies for large language models.

Findings

01

Model perplexity correlates with parameter averages of models trained on data partitions.

02

The method enables inexpensive simulation of data ablations.

03

Training efficiency scales linearly with new data.

Abstract

Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive since the full effect is seen only after training the models; this can lead practitioners to settle for sub-optimal data mixtures. We propose an efficient method for approximating data ablations which trains individual models on subsets of a training corpus and reuses them across evaluations of combinations of subsets. In continued pre-training experiments, we find that, given an arbitrary evaluation set, the perplexity score of a single model trained on a candidate set of data is strongly correlated with perplexity scores of parameter averages of models trained on distinct partitions of that data. From this finding, we posit that researchers and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clarana/ez-data-ablations
pytorchOfficial

Datasets

Videos

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training