GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
Amir Hossein Kargaran, Fran\c{c}ois Yvon, Hinrich Sch\"utze

TL;DR
GlotCC is a large, open-source, noise-cleaned corpus from CommonCrawl covering over 1000 minority languages, created with a reproducible pipeline to support multilingual research.
Contribution
This work introduces GlotCC, a comprehensive, clean, and openly available corpus for minority languages, along with the pipeline and models used for its creation.
Findings
Coverage of over 1000 minority languages
Open-source pipeline and tools provided
Corpus size of 2TB for general domain texts
Abstract
The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community. Corpus v. 1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1, Pipeline v. 3.0 https://github.com/cisnlp/GlotCC.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
