Datasheet for the Pile
Stella Biderman, Kieran Bicheno, Leo Gao

TL;DR
This paper introduces 'The Pile', a large 825 GiB dataset of diverse human-authored text sources designed for training large-scale language models, providing detailed documentation for transparency and reproducibility.
Contribution
It presents a comprehensive dataset with detailed datasheet documentation, enabling better understanding and use in large-scale language model training.
Findings
The dataset contains 22 diverse text sources.
The dataset size is 825 GiB.
Provides transparency through detailed datasheet documentation.
Abstract
This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project, to text data made available by the data owners, to third-party scrapes available online.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗EleutherAI/gpt-neox-20bmodel· 264k dl· ♡ 580264k dl♡ 580
- 🤗CarperAI/diff-codegen-6b-v2model· 46 dl· ♡ 4046 dl♡ 40
- 🤗EleutherAI/pythia-14m-dedupedmodel· 20k dl· ♡ 2920k dl♡ 29
- 🤗EleutherAI/pythia-14mmodel· 95k dl· ♡ 195k dl♡ 1
- 🤗ataeff/pythia-1bmodel· ♡ 1♡ 1
- 🤗CarperAI/FIM-NeoX-1.3Bmodel· 37 dl· ♡ 2637 dl♡ 26
- 🤗EleutherAI/pythia-160m-v0model· 1.1k dl· ♡ 91.1k dl♡ 9
- 🤗EleutherAI/pythia-1.4b-v0model· 866 dl· ♡ 7866 dl♡ 7
- 🤗EleutherAI/pythia-1b-v0model· 1.1k dl· ♡ 61.1k dl♡ 6
- 🤗EleutherAI/pythia-70m-v0model· 1.1k dl· ♡ 71.1k dl♡ 7
- JeanKaddour/minipiledataset· 10k dl10k dl
- EleutherAI/piledataset· 1.8k dl1.8k dl
- andstor/the_pile_githubdataset· 537 dl537 dl
- JonasGeiping/the_pile_WordPiecex32768_97b8e776baafb99c3892e6572a9f51b3dataset· 287 dl287 dl
- JonasGeiping/the_pile_WordPiecex32768_2efdb9d060d1ae95faf952ec1a50f020dataset· 667 dl667 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods
