The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Hugo Lauren\c{c}on, Lucile Saulnier, Thomas Wang, Christopher Akiki,, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou,, Eduardo Gonz\'alez Ponferrada, Huu Nguyen, J\"org Frohberg, Mario, \v{S}a\v{s}ko, Quentin Lhoest, Angelina McMillan-Major

TL;DR
This paper presents the creation of the ROOTS corpus, a 1.6TB multilingual dataset used to train the BLOOM language model, emphasizing ethical data sourcing and open science.
Contribution
It introduces a large, ethically curated multilingual dataset for training massive language models, with open access and analysis tools.
Findings
The ROOTS corpus covers 59 languages.
The dataset was used to train the 176-billion-parameter BLOOM model.
Open data and tools are released for further research.
Abstract
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
