Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages
Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljube\v{s}i\'c,, Miquel Espl\`a-Gomis, Gema Ram\'irez-S\'anchez, Antonio Toral

TL;DR
This study evaluates the quality of four major web-crawled corpora across eleven European languages, revealing that higher intrinsic quality does not necessarily translate into better language model performance on downstream tasks.
Contribution
It provides a comprehensive comparison of web-crawled corpora quality and their actual impact on language model training and performance across multiple languages.
Findings
MaCoCu and OSCAR have the highest intrinsic quality scores.
CC100 corpus yields the best downstream task performance.
Corpus quality differences have limited effect on language model outcomes.
Abstract
Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European languages. Our approach is two-fold: first, we perform an intrinsic evaluation by performing a human evaluation of the quality of samples taken from different corpora; then, we assess the practical impact of the qualitative differences by training specific LMs on each of the corpora and evaluating their performance on downstream tasks. We find that there are clear differences in quality of the corpora,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Softmax · Residual Connection · Weight Decay · Linear Layer · Dense Connections · Adam · Dropout · Multi-Head Attention
