The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Nikhil Kandpal; Brian Lester; Colin Raffel; Sebastian Majstorovic; Stella Biderman; Baber Abbasi; Luca Soldaini; Enrico Shippole; A. Feder Cooper; Aviya Skowron; John Kirchenbauer; Shayne Longpre; Lintang Sutawika; Alon Albalak; Zhenlin Xu; Guilherme Penedo; Loubna Ben Allal; Elie Bakouch; John David Pressman; Honglu Fan; Dashiell Stander; Guangyu Song; Aaron Gokaslan; Tom Goldstein; Brian R. Bartoldson; Bhavya Kailkhura; Tyler Murray

arXiv:2506.05209·cs.CL·June 6, 2025

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal

PDF

Open Access 3 Models 5 Datasets 1 Video

TL;DR

This paper introduces the Common Pile v0.1, an 8TB openly licensed text dataset for training large language models, demonstrating its effectiveness by training competitive 7-billion-parameter models.

Contribution

The authors curated and released a large, high-quality openly licensed dataset and validated its utility by training competitive LLMs, addressing data size and quality limitations of prior efforts.

Findings

01

Models trained on the Common Pile achieve performance comparable to models trained on unlicensed data.

02

The dataset covers diverse domains including research, code, books, and transcripts.

03

The authors release both the dataset and training code for community use.

Abstract

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text· slideslive

Taxonomy

TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Artificial Intelligence in Healthcare and Education

MethodsLLaMA