How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

Kushal Tatariya; Artur Kulmizev; Wessel Poelman; Esther Ploeger; Marcel Bollmann; Johannes Bjerva; Jiaming Luo; Heather Lent; Miryam de Lhoneux

arXiv:2411.05527·cs.CL·May 5, 2026

How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

Kushal Tatariya, Artur Kulmizev, Wessel Poelman, Esther Ploeger, Marcel Bollmann, Johannes Bjerva, Jiaming Luo, Heather Lent, Miryam de Lhoneux

PDF

TL;DR

This paper critically examines the quality of non-English Wikipedia, revealing systematic issues and proposing a ranking system, demonstrating that quality filtering can improve NLP models trained on Wikipedia data.

Contribution

It introduces a systematic quality assessment and ranking of non-English Wikipedia, and evaluates the impact of data filtering on NLP model performance.

Findings

01

Filtering reveals systematic quality issues like contamination and bot content.

02

A 4-level quality ranking correlates with other quality measures.

03

Models trained on filtered data perform as well or better, especially for lower-quality editions.

Abstract

Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in NLP. However, in recent years, such assumptions of high quality have become the subject of scrutiny in low-resource and multilingual contexts. In this study, we subject the entirety of non-English Wikipedia to a data filtering procedure typically reserved for noisy web-text -- a process which removes a large percentage of the collection's data. In analysing the removed data, we reveal numerous systematic quality issues, such as script and language contamination, repeated template and placeholder articles, and a high concentration of bot-generated content. We consolidate these findings into a 4-level quality ranking of Wikipedia, which shows strong correspondence with alternative quality measures and heuristics. Lastly, we evaluate the downstream impact of quality filtering in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.