Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Julia Kreutzer; Isaac Caswell; Lisa Wang; Ahsan Wahab; Daan van Esch,; Nasanbayar Ulzii-Orshikh; Allahsera Tapo; Nishant Subramani; Artem Sokolov,; Claytone Sikasote; Monang Setyawan; Supheakmungkol Sarin; Sokhar Samb,; Beno\^it Sagot; Clara Rivera; Annette Rios; Isabel Papadimitriou; Salomey; Osei; Pedro Ortiz Suarez; Iroro Orife; Kelechi Ogueji; Andre Niyongabo; Rubungo; Toan Q. Nguyen; Mathias M\"uller; Andr\'e M\"uller; Shamsuddeen; Hassan Muhammad; Nanda Muhammad; Ayanda Mnyakeni; Jamshidbek Mirzakhalov,; Tapiwanashe Matangira; Colin Leong; Nze Lawson; Sneha Kudugunta; Yacine; Jernite; Mathias Jenny; Orhan Firat; Bonaventure F. P. Dossou; Sakhile; Dlamini; Nisansa de Silva; Sakine \c{C}abuk Ball{\i}; Stella Biderman,; Alessia Battisti; Ahmed Baruwa; Ankur Bapna; Pallavi Baljekar; Israel Abebe; Azime; Ayodele Awokoya; Duygu Ataman; Orevaoghene Ahia; Oghenefego Ahia,; Sweta Agrawal; Mofetoluwa Adeyemi

arXiv:2103.12028·cs.CL·February 22, 2022·AfricaNLP

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch,, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov,, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb,, Beno\^it Sagot, Clara Rivera, Annette Rios

PDF

5 Datasets

TL;DR

This paper audits 205 multilingual web-crawled datasets, revealing quality issues such as non-usable text, mislabeling, and low sentence quality, and offers recommendations for evaluation and improvement.

Contribution

It provides a comprehensive manual and automatic audit of major multilingual datasets, highlighting prevalent quality issues and proposing evaluation techniques.

Findings

01

Many datasets contain less than 50% acceptable sentences.

02

At least 15 corpora have no usable text.

03

Issues are detectable even by non-experts.

Abstract

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Authorship Attribution and Profiling

MethodsOSCAR