Data, Data Everywhere: A Guide for Pretraining Dataset Construction
Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Bo Liu, Aastha, Jhunjhunwala, Zhilin Wang, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

TL;DR
This paper systematically analyzes the construction of large pretraining datasets for language models, identifying effective methods and proposing improvements based on data attributes to enhance model performance.
Contribution
It provides the first comprehensive study of pretraining dataset construction, including ablations, data source categorization, and attribute-based refinement strategies.
Findings
Certain data collection techniques significantly improve downstream accuracy.
Web crawl data varies widely in toxicity, quality, and domain, affecting model training.
Attribute-based data filtering enhances pretraining set quality.
Abstract
The impressive capabilities of recent language models can be largely attributed to the multi-trillion token pretraining datasets that they are trained on. However, model developers fail to disclose their construction methodology which has lead to a lack of open information on how to develop effective pretraining sets. To address this issue, we perform the first systematic study across the entire pipeline of pretraining set construction. First, we run ablations on existing techniques for pretraining set development to identify which methods translate to the largest gains in model accuracy on downstream evaluations. Then, we categorize the most widely used data source, web crawl snapshots, across the attributes of toxicity, quality, type of speech, and domain. Finally, we show how such attribute information can be used to further refine and improve the quality of a pretraining set. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsBig Data Technologies and Applications · Data Quality and Management
MethodsSparse Evolutionary Training
