GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for   Minority Languages

Amir Hossein Kargaran; Fran\c{c}ois Yvon; Hinrich Sch\"utze

arXiv:2410.23825·cs.CL·March 5, 2025

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

Amir Hossein Kargaran, Fran\c{c}ois Yvon, Hinrich Sch\"utze

PDF

2 Repos 2 Datasets 1 Video

TL;DR

GlotCC is a large, open-source, noise-cleaned corpus from CommonCrawl covering over 1000 minority languages, created with a reproducible pipeline to support multilingual research.

Contribution

This work introduces GlotCC, a comprehensive, clean, and openly available corpus for minority languages, along with the pipeline and models used for its creation.

Findings

01

Coverage of over 1000 minority languages

02

Open-source pipeline and tools provided

03

Corpus size of 2TB for general domain texts

Abstract

The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community. Corpus v. 1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1, Pipeline v. 3.0 https://github.com/cisnlp/GlotCC.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages· slideslive