Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
Clara Na, Ian Magnusson, Ananya Harsh Jha, Tom Sherborne and, Emma Strubell, Jesse Dodge, Pradeep Dasigi

TL;DR
This paper introduces a scalable, efficient method for approximating data ablations in large language models by training on data subsets and reusing models, enabling cost-effective evaluation of data mixtures.
Contribution
It proposes a novel modular training and merging approach that significantly reduces the cost of data ablation studies for large language models.
Findings
Model perplexity correlates with parameter averages of models trained on data partitions.
The method enables inexpensive simulation of data ablations.
Training efficiency scales linearly with new data.
Abstract
Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive since the full effect is seen only after training the models; this can lead practitioners to settle for sub-optimal data mixtures. We propose an efficient method for approximating data ablations which trains individual models on subsets of a training corpus and reuses them across evaluations of combinations of subsets. In continued pre-training experiments, we find that, given an arbitrary evaluation set, the perplexity score of a single model trained on a candidate set of data is strongly correlated with perplexity scores of parameter averages of models trained on distinct partitions of that data. From this finding, we posit that researchers and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training
