Data, Data Everywhere: A Guide for Pretraining Dataset Construction

Jupinder Parmar; Shrimai Prabhumoye; Joseph Jennings; Bo Liu; Aastha; Jhunjhunwala; Zhilin Wang; Mostofa Patwary; Mohammad Shoeybi; Bryan Catanzaro

arXiv:2407.06380·cs.CL·October 22, 2024

Data, Data Everywhere: A Guide for Pretraining Dataset Construction

Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Bo Liu, Aastha, Jhunjhunwala, Zhilin Wang, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

PDF

Open Access 1 Video

TL;DR

This paper systematically analyzes the construction of large pretraining datasets for language models, identifying effective methods and proposing improvements based on data attributes to enhance model performance.

Contribution

It provides the first comprehensive study of pretraining dataset construction, including ablations, data source categorization, and attribute-based refinement strategies.

Findings

01

Certain data collection techniques significantly improve downstream accuracy.

02

Web crawl data varies widely in toxicity, quality, and domain, affecting model training.

03

Attribute-based data filtering enhances pretraining set quality.

Abstract

The impressive capabilities of recent language models can be largely attributed to the multi-trillion token pretraining datasets that they are trained on. However, model developers fail to disclose their construction methodology which has lead to a lack of open information on how to develop effective pretraining sets. To address this issue, we perform the first systematic study across the entire pipeline of pretraining set construction. First, we run ablations on existing techniques for pretraining set development to identify which methods translate to the largest gains in model accuracy on downstream evaluations. Then, we categorize the most widely used data source, web crawl snapshots, across the attributes of toxicity, quality, type of speech, and domain. Finally, we show how such attribute information can be used to further refine and improve the quality of a pretraining set. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Data, Data Everywhere: A Guide for Pretraining Dataset Construction· underline

Taxonomy

TopicsBig Data Technologies and Applications · Data Quality and Management

MethodsSparse Evolutionary Training