Essential-Web v1.0: 24T tokens of organized web data
Essential AI: Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk

TL;DR
Essential-Web v1.0 is a large, well-organized web dataset with 24 trillion tokens, annotated with detailed taxonomy labels, enabling efficient dataset curation for various domains and improving data quality for language model training.
Contribution
The paper introduces Essential-Web v1.0, a massive web dataset with taxonomy annotations produced by a fine-tuned model, facilitating targeted data filtering and domain-specific dataset creation.
Findings
Achieves high annotation agreement comparable to larger models.
Enables domain-specific datasets with improved quality metrics.
Provides a publicly available, organized web dataset for NLP research.
Abstract
Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v1-VIDEOmodel· ♡ 1♡ 1
- 🤗EssentialAI/eai-distill-0.5bmodel· 259k dl· ♡ 25259k dl♡ 25
- 🤗QuantFactory/eai-distill-0.5b-GGUFmodel· 352 dl· ♡ 2352 dl♡ 2
- 🤗mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v1model· 755 dl· ♡ 1755 dl♡ 1
- 🤗pawara101/FoodExtract-gemma-3-270m-fine-tune-v1model· 239 dl239 dl
- 🤗rtdoit/FoodExtract-gemma-3-270m-fine-tune-v1model
- 🤗anquachdev/FoodExtract-gemma-3-270m-fine-tune-v1model· 9 dl9 dl
- EssentialAI/essential-web-v1.0dataset· 47k dl47k dl
- EssentialAI/eai-taxonomy-math-w-fmdataset· 6.0k dl6.0k dl
- EssentialAI/eai-taxonomy-stem-w-dclm-100b-sampledataset· 628 dl628 dl
- EssentialAI/eai-taxonomy-med-w-dclm-100b-sampledataset· 55 dl55 dl
- EssentialAI/eai-taxonomy-code-w-dclm-100b-sampledataset· 219 dl219 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Semantic Web and Ontologies · Advanced Database Systems and Queries
