The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Hugo Lauren\c{c}on; Lucile Saulnier; Thomas Wang; Christopher Akiki,; Albert Villanova del Moral; Teven Le Scao; Leandro Von Werra; Chenghao Mou,; Eduardo Gonz\'alez Ponferrada; Huu Nguyen; J\"org Frohberg; Mario; \v{S}a\v{s}ko; Quentin Lhoest; Angelina McMillan-Major; Gerard Dupont; Stella; Biderman; Anna Rogers; Loubna Ben allal; Francesco De Toni; Giada Pistilli,; Olivier Nguyen; Somaieh Nikpoor; Maraim Masoud; Pierre Colombo; Javier de la; Rosa; Paulo Villegas; Tristan Thrush; Shayne Longpre; Sebastian Nagel; Leon; Weber; Manuel Mu\~noz; Jian Zhu; Daniel Van Strien; Zaid Alyafeai; Khalid; Almubarak; Minh Chien Vu; Itziar Gonzalez-Dios; Aitor Soroa; Kyle Lo; Manan; Dey; Pedro Ortiz Suarez; Aaron Gokaslan; Shamik Bose; David Adelani; Long; Phan; Hieu Tran; Ian Yu; Suhas Pai; Jenny Chim; Violette Lepercq; Suzana; Ilic; Margaret Mitchell; Sasha Alexandra Luccioni; Yacine Jernite

arXiv:2303.03915·cs.CL·March 8, 2023·65 cites

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Hugo Lauren\c{c}on, Lucile Saulnier, Thomas Wang, Christopher Akiki,, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou,, Eduardo Gonz\'alez Ponferrada, Huu Nguyen, J\"org Frohberg, Mario, \v{S}a\v{s}ko, Quentin Lhoest, Angelina McMillan-Major

PDF

Open Access 1 Models 4 Datasets 1 Video

TL;DR

This paper presents the creation of the ROOTS corpus, a 1.6TB multilingual dataset used to train the BLOOM language model, emphasizing ethical data sourcing and open science.

Contribution

It introduces a large, ethically curated multilingual dataset for training massive language models, with open access and analysis tools.

Findings

01

The ROOTS corpus covers 59 languages.

02

The dataset was used to train the 176-billion-parameter BLOOM model.

03

Open data and tools are released for further research.

Abstract

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
TurkuNLP/bloom-finnish-176b
model· 9 dl· ♡ 6
9 dl♡ 6

Datasets

Videos

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies