The ROOTS Search Tool: Data Transparency for LLMs

Aleksandra Piktus; Christopher Akiki; Paulo Villegas; Hugo; Lauren\c{c}on; G\'erard Dupont; Alexandra Sasha Luccioni; Yacine Jernite,; Anna Rogers

arXiv:2302.14035·cs.CL·February 28, 2023

The ROOTS Search Tool: Data Transparency for LLMs

Aleksandra Piktus, Christopher Akiki, Paulo Villegas, Hugo, Lauren\c{c}on, G\'erard Dupont, Alexandra Sasha Luccioni, Yacine Jernite,, Anna Rogers

PDF

Open Access 1 Repo 1 Datasets

TL;DR

The paper introduces the ROOTS Search Tool, an open-source search engine for the extensive ROOTS multilingual corpus, enhancing data transparency and governance for large language model training.

Contribution

It presents the development and implementation of a comprehensive search tool for the ROOTS corpus, enabling detailed investigation and transparency of data used in LLM training.

Findings

01

Largest searchable corpus with fuzzy and exact search capabilities

02

Open-sourced tool available on Hugging Face Spaces

03

Facilitates data transparency and governance in LLM training

Abstract

ROOTS is a 1.6TB multilingual text corpus developed for the training of BLOOM, currently the largest language model explicitly accompanied by commensurate data governance efforts. In continuation of these efforts, we present the ROOTS Search Tool: a search engine over the entire ROOTS corpus offering both fuzzy and exact search capabilities. ROOTS is the largest corpus to date that can be investigated this way. The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces. We describe our implementation and the possible use cases of our tool.

Tables1

Table 1. Table 1: Each row represents a single BM25 index we build.

ROOTS language tag	# documents	Data size (GB)	# snippets	Index size (GB)	Analyzer
zh, zhs, zht	88,814,841	259.01	111,284,681	682	zh
indic	84,982,982	70.45	100,810,124	714.08	whitespace
en	77,010,827	470.47	695,521,432	766.14	en
es	67,005,817	172.40	267,542,136	264.35	es
fr	58,847,091	204.03	299,938,546	305.29	fr
vi	34,110,375	42.83	76,164,552	72.89	whitespace
pt	31,969,891	77.59	122,221,863	119.98	pt
code	26,176,998	173.16	365,424,222	206.96	whitespace
ar	15,234,080	73.75	68,509,441	93.71	ar
id	12,514,253	19.63	29,531,873	27.16	id
ca	6,142,390	17.42	26,844,600	29.65	es
eu	5,149,797	2.36	6,219,039	4.56	whitespace
nigercongo	1,162,568	0.48	1,462,238	0.89	whitespace
total	597,936,751	1583.59	2,171,474,747	2518.99

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huggingface/roots-search-tool
noneOfficial

Datasets

society-ethics/papers
dataset· 44 dl
44 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Natural Language Processing Techniques · Semantic Web and Ontologies

MethodsBLOOM

Full text

The ROOTS Search Tool: Data Transparency for LLMs

Aleksandra Piktus1,2 Christopher Akiki3,4 Paulo Villegas5 Hugo Laurençon1

** Gérard Dupont6 Alexandra Sasha Luccioni1 Yacine Jernite1 Anna Rogers7 **

1Hugging Face 2Sapienza University 3Leipzig University 4ScaDS.AI

5Telefonica I+D 6Mavenoid 7University of Copenhagen

[email protected]

Abstract

ROOTS is a 1.6TB multilingual text corpus developed for the training of BLOOM, currently the largest language model explicitly accompanied by commensurate data governance efforts. In continuation of these efforts, we present the ROOTS Search Tool: a search engine over the entire ROOTS corpus offering both fuzzy and exact search capabilities. ROOTS is the largest corpus to date that can be investigated this way. The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces. We describe our implementation and the possible use cases of our tool.

1 Introduction

Large language models (LLMs) are ubiquitous in modern NLP, used directly to generate text and as building blocks in downstream applications. The ever-increasing size of the latest models inflates the demand for massive volumes of training data Hoffmann et al. (2022), in practice sourced mainly from the Web. This raises questions concerning the quality of the data, the feasibility of curating and inspecting it, as well as documenting it in terms of what kinds of speech and speakers it represents Jo and Gebru (2020); Bender et al. (2021); Akiki et al. (2022). Without that level of characterization, we cannot tell for what varieties of language the resulting models can be expected to work well, whether the data was ethically sourced, how to interpret evaluation metrics, and to what degree a particular output was memorized directly from the training data. In an encouraging new trend, we see researchers exploring ways to quantitatively describe large datasets Mitchell et al. (2022). However, user-friendly tools for an extensive qualitative analysis are still predominantly missing. In our current work, we aim to fill that gap for a specific, web-scale, textual corpus.

Building on the efforts of the BigScience workshop,111bigscience.huggingface.co we present the ROOTS Search Tool 222hf.co/spaces/bigscience-data/roots-search—a search engine for the the 1.6TB multilingual ROOTS corpus Laurençon et al. (2022). The ROOTS corpus was created to pre-train BLOOM Scao et al. (2022)—the first LLM of its scale designed with commensurate efforts in responsible licensing333bigscience.huggingface.co/blog/the-bigscience-rail-license and data governance Jernite et al. (2022). We hope that our tool will facilitate qualitative analysis of the web-scale ROOTS corpus, and establish the qualitative analysis of training data—critical for the model understanding and governance work—as an essential step in the development of LLMs.

2 Related Work

Corpus linguistics.

The core methodology for studying large volumes of text was developed in corpus linguistics McEnery and Hardie (2013), an area of research responsible for curating large text collections carefully designed to represent specific varieties of language. For example, the 100M word British National Corpus Leech (1992) was created to represent the spoken and written British English of the late 20th century, with each text handpicked by experts, who also procured appropriate copyright exemptions. Similar national corpora were later created for many other languages, e.g. Japanese Maekawa (2008). The texts were often accompanied by multiple layers of annotations—syntactic, morphological, semantic, genre, source etc. This enabled valuable empirical research on the variants of represented languages, finding use in early distributional semantic models. Corpus linguistics developed sophisticated methodologies including concordances, word sketches and various word association measures (Stefanowitsch and Gries, 2003; Baker, 2004; Kilgarriff, 2014, among others). However, this methodology did not adapt well to Web-scale corpora due to the lack of tools and resources that could support such scale.

Web-scale corpora for LLM pre-training.

As LLMs grew, so did the need for massive pre-training datasets. To date, there were several efforts to collect and clean large English and multilingual corpora Raffel et al. (2020); Xue et al. (2021); Gao et al. (2020); Ortiz Suárez et al. (2020); Bañón et al. (2020); El-Kishky et al. (2020). Non-English, monolingual corpora of this scale have also started to emerge Gutiérrez-Fandiño et al. (2022); Kummervold et al. (2022) However, the sheer scale of such datasets renders them hard to properly curate: we now know that the data used for training LLMs may contain synthetic data Dodge et al. (2021), privacy-infringing data Carlini et al. (2020); Huang et al. (2022), incorrect language codes or and translations Kreutzer et al. (2022), not to mention the ubiquitous issues with social biases (Blodgett et al., 2020; Field et al., 2021; Stanczak and Augenstein, 2021, among others). Another issue pertains to the permissions to use the data, which, perhaps the most famously, surfaced in relation to the BookCorpus Zhu et al. (2015), used, among others, to train BERT Devlin et al. (2019), but collected without author permissions and eventually taken down by the authors Bandy and Vincent (2021).

These issues are a consequence of the fact that the current web-scale corpora are opportunistic samples of publicly available text, rather than artifacts curated to provide a representative snapshot of a specific language variety, as in the corpus linguistics work Rogers (2021). This highlights the general problem with the lack of documentation in NLP datasets of all sizes Bender and Friedman (2018); Gebru et al. (2020), and the fact that data work has generally not been a priority in NLP recently Sambasivan et al. (2021).

Information Retrieval for massive text corpora.

Inspecting large data collection is a central topic of study in another Machine Learning domain, namely Information Retrieval. Even though multiple techniques for analysing large document collections have been developed over the years, there has been little interest so far in applying them specifically to study LLM training data. The closest to out work is the C4 Raffel et al. (2020) Search 444https://c4-search.apps.allenai.org/, however, the tool comes with no documentation to explain the details of the indexed variant of the dataset or applied design choices. Similar tools emerge for smaller, more specialised corpora, e.g. COVID-related datasets (Zhang et al., 2020), news quotes Vuković et al. (2022) and medical literature (Niezni et al., 2022). Razeghi et al. (2022) provide an interface to pre-computed term frequencies from the Pile, but it does not provide full-text corpus search. In the Computer Vision community, related efforts555https://haveibeentrained.com/ target large text and image datasets such as LAION Schuhmann et al. (2022, 2021).

We believe our work to be the first principled effort in providing search access to the training corpus of an existing large language model and the largest text dataset search tool currently available.

3 The ROOTS corpus

The ROOTS corpus Laurençon et al. (2022) is a high-quality, heterogeneous, multilingual text corpus collected as part of the BigScience project to train the BLOOM LLM Scao et al. (2022). ROOTS consists of 1.6TB of data in 46 natural and 13 programming languages. The full ROOTS dataset is open to the members of the BigScience Data organization on the Hugging Face hub, which the interested researchers can still apply to join666Sign-up link is available here.

3.1 Data Governance

The development of the BLOOM model within the BigScience project was backed by significant work on data governance, as it is was identified early on as one of the highest-impact levers of action to enable better accountability and data subject agency in modern ML technology777Data governance and representation in BigScience.. Participants started by designing a new governance framework to meet the unique needs of distributed data governance for web-scale data in terms of respecting data subject rights Jernite et al. (2022). A partial implementation of this framework was used for the ROOTS data as described by Laurençon et al. (2022), focusing on explicit agreements with data custodians, extensive documentation of the data sources, technical tools for privacy-enhancing data handling, and purpose-specific access to subsets of the data.

The present tool goes one step further in implementing the proposed data governance feedback by enabling examination and feedback for the data sources from any interested parties; while still maintaining the controlled access necessary to the proposed governance. The tool only provides 128-word snippets of indexed documents, akin to regular web search engines, and hence provides no practical way to reconstruct the full corpus. The snippets are traceable to their origin in the full ROOTS corpus, and we additionally link to original source documents whenever possible.888Unfortunately, the metadata in ROOTS is inconsistent and we only have access to URLs in the pseudocrawl datasets. Finally, users of the tool are able to flag specific search results with an explanation to outline possible infringements of data subjects’ privacy or intellectual property rights. At this stage, the information collected from the flagging process is primarily intended to serve as a basis for future research on collaborative data governance processes. We provide more examples of use cases to support data examination and governance in Section 5.

3.2 Data Pre-processing

Documents vs snippets.

ROOTS consists of documents of varying lengths, with outliers as long as 282,571 words. For fuzzy search, we split documents into short snippets of at most 128 words and index snippets rather than the original documents. This helps us follow the controlled access principle discussed in the previous section and makes indexed snippets more comparable in the context of fuzzy search. In exact search, we look for the exact occurrences of the input query within documents and construct snippets ad hoc, including words on both sides of the detected occurrence.

Unique Result IDs.

In order to be able to trace search results back to their source, we construct result IDs, adopting the following convention: (a) we include the dataset name as defined on the Hugging Face Hub, followed by (b) the ID of the document from which the given snippet came, (c) and a question mark. We then include parameters which differ depending on the search strategy used. In fuzzy search we introduce two parameters: the seg parameter describing the segmentation strategy applied during the pre-processing stage, and the seg_id parameter indicating the rank of the given snippet under the specified segmentation strategy. For exact search, we include a single id parameter indicating the the rank of the occurrence of the query in the current document.

PII redaction.

During preliminary experiments on the ROOTS corpus, OSCAR Ortiz Suárez et al. (2019) has been identified as a source of a large amount of documents containing personally identifiable information (PII). A regular-expression-based PII redaction script999The BigScience PII redaction script is available here has been applied to OSCAR prior to BLOOM training. However, the dataset itself still contains unredacted text. In order to avoid leaking PII through our search tool, we apply an improved variant of the BigScience PII redaction script on the backend side and display results with PII redacted in a visible way - this way one can inspect the data and observe the problem, but personal information are predominantly removed. An example is shown in Figure 2.

4 Implementation

Fuzzy Search Backend.

The ROOTS corpus is organized in 498 datasets, each annotated with a language identifier. There are two types of identifiers: those indicating an individual language (e.g. pt for Portuguese), and those indicating a language within a language group (e.g. indic-mr for Marathi, as part of the Indic language group). All programming languages are collected under a common code tag. We build 13 sparse, BM25 Robertson (2009) indices: one per language group for the indic and nigercongo groups, one for code, and one for each of the remaining languages (except Chinese, where we combine the tags zh, zht, and zhs into a single index). Table 1 presents the basic information per index. We index respective subsets of the corpus using Pyserini Lin et al. (2021), a leading toolkit for reproducible IR research. Tokenization is performed with native Lucene101010https://lucene.apache.org/ analyzers available via Pyserini API (see Table 1 to check which analyzers were used for specific indices).

Exact Search Backend.

We leverage a suffix array implementation111111https://github.com/google-research/deduplicate-text-datasets proposed by Lee et al. (2022). We build the suffix array for the whole ROOTS corpus, this time without the split into languages or language groups. We host both the BM25 indices and the suffix array on Hugging Face-provisioned machines. The server code is open-sourced121212https://github.com/huggingface/roots-search-tool.

Frontend and User Experience.

The ROOTS Search Tool user interface is built with Gradio Abid et al. (2019) and served via Hugging Face Spaces.131313https://huggingface.co/docs/hub/spaces. By default, searches are performed in fuzzy mode, in order to move to the exact search one can enclose the query in double quotes. Fuzzy searches can be performed in a user-specified language, or in all languages (in that case results are surfaced separately for each language). We also provide an option to auto-detect the language of the query with a FastText Joulin et al. (2017) classifier. Results are displayed in the order of decreasing relevance; users can control the maximum number of results they want to see using a slider. In exact search mode, the backend returns all documents matching a given query exactly, irrespective of the language, and they are displayed over multiple pages in a random order, with the max results parameter controlling the size of a single page. The total number of matched results is displayed at the top of the results page. PII redaction is applied to all results on the backend side. The tool also allows users to filter out all results from a specific dataset appearing on a given page.

5 Use cases

Detecting PII issues to improve obfuscation.

BLOOM was trained with efforts to detect and obfuscate PII in the original ROOTS documents, and as described in subsection 3.2, we build on that effort when obfuscating PII in search results. However, it is still possible that some such data was not detected. The tool allows searching for the specific PII by concerned individuals, which is the first step for requesting removal of their data. One could also simply search for their name to see if they are represented in the corpus, and how.

Detecting problematic content.

Text from Web crawls is not necessarily high-quality human-written text. Among the possible problems are hate speech, excessive pornography, synthetic text (e.g. machine-translated text, AI-generated text), word lists that are not meaningful and are meant to trick search engines (Hamilton, 2013), factually incorrect text such as fake news or conspiracy theories. For example, we found at least 5 snippets from the OSCAR source incorrectly arguing that Barack Obama was born in Kenya. While the creators of ROOTS employed filtering strategies targeted specifically at spam and machine-generated content Laurençon et al. (2022), developing filters for such content is a never-ending arms race with its producers, and the only way to keep improving them is to look at the data—which our tool enables.

Studying representation of dialects and social groups.

When LLM-based systems are deployed, the implicit assumption is often that they are general-purpose and can serve all of its potential users equally well. But there is no such thing as a “neutral”, one-size-fits-all corpus Rogers (2021). An obvious issue is dialects, and in case of multilingual models like BLOOM another obvious problem is language imbalance. Besides that, the training data may not equally represent the topics and sources associated with different demographic groups, and hence the LLM would likely not cater to them equally well. Bender et al. (2021) cite the example of GPT-2: the filter for its sources was that they were shared on Reddit, which overrepresents the interests of the typical Reddit user (of whom in the US 67% are men, and 64% are 18-29 y.o.)

Training data that is then likely to reinforce social stereotypes harmful to marginalized populations. For example, GPT-3 has been shown to over-associate Muslims with violence Abid et al. (2021). In particular, prompting the model to continue “Two Muslims walked into…” tends to lead to mentions of terrorism or assault. BLOOM is not free from these biases: we sampled 10 completions and found 4 that mentioned guns or death (compared to 66% reported for GPT-3). Exact search for “Two Muslims Walked into…” returned examples of papers studying this very phenomenon, but a search for just “Two Muslims” shows that many passages in OSCAR mention violence or terrorism, whereas mentions in Semantic Scholar, pseudo-crawled websites, and Wikipedia are more varied.

Detecting the presence of specific information.

Where the suitability of a model to a given application depends on it being up-to-date with the latest events, or knowledge about a given fact, a tool like ours can help to quickly find out if the model even theoretically could “learn” a given fact. For instance, ROOTS contains 231 references to the death of Queen Elizabeth, but they refer to the death Elizabeth I in 1603 and not to the recent passing of Elizabeth II in 2022.

Detecting plagiarism/memorization.

Generative LLMs can memorize part of their training sets and repeat it verbatim in their outputs. We can probe an LLM to elicit candidates for data memorization Carlini et al. (2020), and the ROOTS Search Tool can help in different ways:

•

By conditioning model probing on actual training data, so that we can more easily check whether such data has been memorized;

•

By providing the ground truth to verify that model output was part of the training data;

•

By providing the ground truth to verify that model did have a chance to memorize something that it should have memorized;

•

By providing match counts to identify which data was more likely to be memorized (since the number of copies in the training data influences memorization Kandpal et al. (2022)).

For example, BLOOM correctly completes Prince Hamlet’s To be or not to be soliloquy—both using greedy decoding and nucleus sampling—but not the less popular Shakespeare quote I am in this earthly world, where to do harm… is often laudable, to do good sometime accounted dangerous folly. With our tool we verified that BLOOM had access to at least 7 sources for the Macbeth quote (vs at least 47 for Hamlet), but did not “learn” it.

Verifying originality.

An important question about generative AI models is to what extent their output – that is not a verbatim copy of training data – can be considered original. Consider the above quote from Macbeth, which BLOOM completed for us as follows: “I am in this earthly world, where to do harm… is to do good, and to do good is to do harm.” With our tool, we could easily verify that the suggested completion does not exist in the corpus verbatim. However, there are dozens of contexts where the concepts of “good” and “harm” are mentioned close to each other (esp. in the phrase “do more harm than good”), so they were the likely indirect sources for this completion. To what degree that completion can be considered new, original text is a key question for the current discussions on plagiarism in AI writing assistants and the legal status of their output.

Non-existing facts.

When the same associative mechanism generates factoid text, the model may “hallucinate” events that never occurred—or at least, there was no evidence on which the model could draw. This, too, becomes easy to verify with our tool. BLOOM completed the prompt “When was the Golden Gate Bridge transported for the second time across Egypt?” Hofstadter (2022) with “The first time was in the late 19th century, when the bridge was transported from San Francisco to Cairo”. Of course, this “fact” is untrue, and was not mentioned in the corpus. But we could not even find mentions of anything else transported from San Francisco to Cairo. How exactly LLMs come up with such generations is an interesting research problem, for which tools like ours could be useful.

Enabling data removal requests.

The authors of texts that were included in web crawls could use such a tool to identify that fact and request the removal of their texts. For ROOTS, the data governance structure set up for Big Science workshop operated only for its duration, but should there be any future work relying on the same data hosts and agreements, the flagged data collected through our tool can be used to honor the removal requests.

Benchmark data contamination.

To interpret benchmark results, we need to know whether they reflect training data memorization or generalization. One approach is for the model authors to specifically plan for the evaluation benchmarks prior to training, and try to exclude the benchmark data Brown et al. (2020), but this limits the options for external evaluation. Our tool enables sampled checks of benchmark data, and was already successfully used to find141414https://twitter.com/WilliamBarrHeld/status/1586090252946448384 that BLOOM should not be evaluated on XNLI Conneau et al. (2018).

Language contamination.

According to Laurençon et al. (2022), ROOTS contains data in 46 languages. But this is clearly not the full story. For example, neither Danish nor Ukrainian are listed, but we found examples in these languages (stackexchange, OSCAR, parsed academic pdf data). The tool can thus be useful for investigating the transfer to “unseen” languages in multilingual evaluation.

Word sense disambiguation.

Since the ROOTS Search Tool provides context paragraphs, it can be used to check in what sense a word was used in the training data. For example, the acronym LLM in ROOTS is used as “large language model” in the parsed academic article data, but in OSCAR it means predominantly “limited liability company” or “Legum Magister”. If future work extends our approach to providing search results through API, then quantitative research would also be possible with techniques like context clustering and classification.

Pre-processing issues.

By searching for phrases occurring in different parts of the same document, it is possible to verify that the entire document made it through the pre-processing pipeline – which is useful for improving it. For example, we found a news article in OSCAR, the initial paragraphs of which are missing from ROOTS.

6 Limitations and Future Work

A major limitation of this work is that to mitigate possible issues on the data governance side, we can only provide short snippets of the indexed texts, as is typical of web search engines. We strive to provide links to the original text sources, but this metadata is not consistently available in ROOTS.

Implementation-wise, the current version of exact search is exact down to capitalization and punctuation, and fuzzy search can be noticeably slower. These issues will be addressed in future versions.

The current tool is heavily influenced by the UX of search engines, and its core functionality is similar. In future we intend to review classic corpus analysis tools for ideas of different presentation modes, such as concordance and word sketches. We would like to add more quantitative information, e.g. term frequency information, number of hits, and co-occurrence statistics. Community feedback and suggestions are welcome in the Community tab of the demo. We are also pursuing a spin-off collaboration with Pyserini to make large scale indexing and hosting of textual data even more seamless.

7 Acknowledgements

We thank the Pyserini team—Odunayo Ogundepo, Xinyu Zhang, Akintunde Oladipo and Jimmy Lin, for their indexing insights. Big thanks to the Gradio team, especially Pete Allen, Abubakar Abid and Freddy Boulton for their support on the frontend side, and to the Hugging Face infra team for answering questions regarding hosting the tool. We thank Carlos Muñoz Ferrandis and Meg Mitchell for valuable discussions.

8 Impact Statement

Our tool aims to improve the current state of documentation search for large corpora of web-scraped text, starting with the ROOTS corpus. However, it also comes withe ethical considerations: for instance, it can also inadvertently display sensitive information such as PII and harmful content, and help malicious actors find information about a given topic from multiple sources (which is more difficult given only the raw text of the corpus). We are aware of these limitations, and have taken precautions to compensate for them, such as the PII redaction measures we present in Figure 2. We also present only a snippet of the raw text, which means that for accessing the full documents, users much sign up to be a part of the Big Science organization on the Hugging Face Hub, which also reduces the amount of information that potentially malicious anonymous users can access.

Bibliography54

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abid et al. (2019) Abubakar Abid, Ali Abdalla, Ali Abid, Dawood Khan, Abdulrahman Alfozan, and James Zou. 2019. Gradio: Hassle-free sharing and testing of ml models in the wild. ar Xiv preprint ar Xiv:1906.02569 .
2Abid et al. (2021) Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent anti-muslim bias in large language models . In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’21, page 298–306, New York, NY, USA. Association for Computing Machinery. · doi ↗
3Akiki et al. (2022) Christopher Akiki, Giada Pistilli, Margot Mieskes, Matthias Gallé, Thomas Wolf, Suzana Ilic, and Yacine Jernite. 2022. Bigscience: A case study in the social construction of a multilingual large language model . In Workshop on Broadening Research Collaborations 2022 .
4Baker (2004) Paul Baker. 2004. Querying Keywords: Questions of Difference, Frequency, and Sense in Keywords Analysis . Journal of English Linguistics , 32(4):346–359. · doi ↗
5Bandy and Vincent (2021) Jack Bandy and Nicholas Vincent. 2021. Addressing "documentation debt" in machine learning research: A retrospective datasheet for bookcorpus . · doi ↗
6Bañón et al. (2020) Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, and Jaume Zaragoza. 2020. Para Crawl: Web-Scale Acquisition of Parallel Corpora . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pag · doi ↗
7Bender and Friedman (2018) Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science . Transactions of the Association for Computational Linguistics , 6:587–604. · doi ↗
8Bender et al. (2021) Emily M. Bender, Timnit Gebru, Angelina Mc Millan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In F Acc T ’21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages 610–623. · doi ↗