Essential-Web v1.0: 24T tokens of organized web data

Essential AI: Andrew Hojel; Michael Pust; Tim Romanski; Yash Vanjani; Ritvik Kapila; Mohit Parmar; Adarsh Chaluvaraju; Alok Tripathy; Anil Thomas; Ashish Tanwer; Darsh J Shah; Ishaan Shah; Karl Stratos; Khoi Nguyen; Kurt Smith; Michael Callahan; Peter Rushton; Philip Monk; Platon Mazarakis; Saad Jamal; Saurabh Srivastava; Somanshu Singla; Ashish Vaswani

arXiv:2506.14111·cs.CL·June 23, 2025

Essential-Web v1.0: 24T tokens of organized web data

Essential AI: Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk

PDF

Open Access 1 Repo 7 Models 5 Datasets

TL;DR

Essential-Web v1.0 is a large, well-organized web dataset with 24 trillion tokens, annotated with detailed taxonomy labels, enabling efficient dataset curation for various domains and improving data quality for language model training.

Contribution

The paper introduces Essential-Web v1.0, a massive web dataset with taxonomy annotations produced by a fine-tuned model, facilitating targeted data filtering and domain-specific dataset creation.

Findings

01

Achieves high annotation agreement comparable to larger models.

02

Enables domain-specific datasets with improved quality metrics.

03

Provides a publicly available, organized web dataset for NLP research.

Abstract

Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

essential-ai/eai-taxonomy
noneOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Semantic Web and Ontologies · Advanced Database Systems and Queries