Do Language Models Care About Text Quality? Evaluating Web-Crawled   Corpora Across 11 Languages

Rik van Noord; Taja Kuzman; Peter Rupnik; Nikola Ljube\v{s}i\'c,; Miquel Espl\`a-Gomis; Gema Ram\'irez-S\'anchez; Antonio Toral

arXiv:2403.08693·cs.CL·March 14, 2024·1 cites

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljube\v{s}i\'c,, Miquel Espl\`a-Gomis, Gema Ram\'irez-S\'anchez, Antonio Toral

PDF

Open Access

TL;DR

This study evaluates the quality of four major web-crawled corpora across eleven European languages, revealing that higher intrinsic quality does not necessarily translate into better language model performance on downstream tasks.

Contribution

It provides a comprehensive comparison of web-crawled corpora quality and their actual impact on language model training and performance across multiple languages.

Findings

01

MaCoCu and OSCAR have the highest intrinsic quality scores.

02

CC100 corpus yields the best downstream task performance.

03

Corpus quality differences have limited effect on language model outcomes.

Abstract

Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European languages. Our approach is two-fold: first, we perform an intrinsic evaluation by performing a human evaluation of the quality of samples taken from different corpora; then, we assess the practical impact of the qualitative differences by training specific LMs on each of the corpora and evaluating their performance on downstream tasks. We find that there are clear differences in quality of the corpora,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Softmax · Residual Connection · Weight Decay · Linear Layer · Dense Connections · Adam · Dropout · Multi-Head Attention