Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch,, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov,, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb,, Beno\^it Sagot, Clara Rivera, Annette Rios

TL;DR
This paper audits 205 multilingual web-crawled datasets, revealing quality issues such as non-usable text, mislabeling, and low sentence quality, and offers recommendations for evaluation and improvement.
Contribution
It provides a comprehensive manual and automatic audit of major multilingual datasets, highlighting prevalent quality issues and proposing evaluation techniques.
Findings
Many datasets contain less than 50% acceptable sentences.
At least 15 corpora have no usable text.
Issues are detectable even by non-experts.
Abstract
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Authorship Attribution and Profiling
MethodsOSCAR
