Does Corpus Quality Really Matter for Low-Resource Languages?

Mikel Artetxe; Itziar Aldabe; Rodrigo Agerri; Olatz; Perez-de-Vi\~naspre; Aitor Soroa

arXiv:2203.08111·cs.CL·October 27, 2022

Does Corpus Quality Really Matter for Low-Resource Languages?

Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz, Perez-de-Vi\~naspre, Aitor Soroa

PDF

5 Models 1 Datasets

TL;DR

This study investigates whether high-quality corpora are essential for low-resource language NLP, finding that larger or more diverse datasets can compensate for lower quality in achieving comparable downstream performance.

Contribution

It introduces EusCrawl, a high-quality Basque corpus created through manual website selection, and demonstrates that corpus quality may be less critical than size and coverage for NLU tasks.

Findings

01

EusCrawl has over 66% high-quality documents, higher than mC4 and CC100.

02

Downstream NLU performance is similar across corpora despite quality differences.

03

Corpus size and domain coverage may outweigh quality in low-resource language NLP.

Abstract

The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking representation learning in Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with <33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream NLU tasks regardless of the corpus used for pre-training. Our work…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

HiTZ/euscrawl
dataset· 89 dl
89 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.