The Stack: 3 TB of permissively licensed source code
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou,, Carlos Mu\~noz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes,, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries

TL;DR
The paper introduces The Stack, a large 3.1 TB dataset of permissively licensed source code in 30 languages, aimed at advancing open research in code-focused large language models.
Contribution
It presents the collection, licensing, and governance of The Stack dataset, demonstrating its effectiveness in training models that perform well on code benchmarks.
Findings
Near-deduplication improves model performance.
Permissively licensed data can match human performance benchmarks.
The dataset is publicly available for research and development.
Abstract
Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research
