Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, Luca Soldaini

TL;DR
This paper introduces WebOrganizer, a framework for organizing web data into domains by topic and format, improving pre-training data curation and model performance through domain-aware data mixing.
Contribution
The paper presents a novel method to automatically annotate and organize web data into domains, enhancing data curation and downstream model performance.
Findings
Domain mixing improves model performance on downstream tasks.
Organizing data by topic and format enhances data curation.
Combining domain insights with quality-based methods boosts effectiveness.
Abstract
Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗WebOrganizer/TopicClassifiermodel· 37k dl· ♡ 1737k dl♡ 17
- 🤗WebOrganizer/TopicClassifier-NoURLmodel· 24k dl· ♡ 1424k dl♡ 14
- 🤗WebOrganizer/FormatClassifier-NoURLmodel· 22k dl· ♡ 722k dl♡ 7
- 🤗WebOrganizer/FormatClassifiermodel· 26k dl· ♡ 926k dl♡ 9
- 🤗WebOrganizer/LM-1b_1x-Baselinemodel· 44 dl44 dl
- 🤗WebOrganizer/LM-1b_1x-Sampling_over_KMeans_for_MMLUmodel· 3 dl3 dl
- 🤗WebOrganizer/LM-1b_1x-Sampling_over_Topics_for_MMLUmodel· 4 dl4 dl
- 🤗WebOrganizer/LM-1b_1x-Sampling_over_Formats_for_MMLUmodel· 2 dl2 dl
- 🤗WebOrganizer/LM-1b_1x-Sampling_over_Topics_x_Formats_for_MMLUmodel· 1 dl1 dl
- 🤗WebOrganizer/LM-1b_1x-Sampling_over_KMeans_for_HellaSwagmodel· 1 dl1 dl
- WebOrganizer/TopicAnnotations-Llama-3.1-8Bdataset· 64 dl64 dl
- WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8dataset· 48 dl48 dl
- WebOrganizer/FormatAnnotations-Llama-3.1-8Bdataset· 51 dl51 dl
- WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8dataset· 19 dl19 dl
- WebOrganizer/Corpus-200Bdataset· 3.4k dl3.4k dl
Videos
Taxonomy
TopicsResearch Data Management Practices · Data Quality and Management · Scientific Computing and Data Management
