Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Pierre-Carl Langlais, Pavel Chizhov, Catherine Arnett, Carlos Rosas Hinostroza, Mattia Nee, Eliot Krzystof Jones, Ir\`ene Girard, David Mach, Anastasia Stasenko, Ivan P. Yamshchikov

TL;DR
Common Corpus is the largest open, ethically sourced dataset for pre-training large language models, covering diverse languages, domains, and including code, facilitating open science and compliant AI development.
Contribution
Introduces the largest open, ethically curated dataset for LLM pre-training, with detailed provenance, filtering, and successful multilingual model training results.
Findings
Models trained on Common Corpus perform comparably to other models of similar size.
The dataset covers a wide variety of languages and domains.
Common Corpus enables research and entrepreneurial applications in AI.
Abstract
Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. Such datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under open licenses, totaling about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
