An Open Multilingual System for Scoring Readability of Wikipedia
Mykola Trokhymovych, Indira Sen, Martin Gerlach

TL;DR
This paper introduces a multilingual model for assessing Wikipedia article readability across 14 languages, addressing the lack of such tools beyond English and demonstrating strong zero-shot performance.
Contribution
The authors develop the first multilingual readability assessment model for Wikipedia, trained on a novel dataset spanning 14 languages, enabling scalable evaluation without language-specific ground truth.
Findings
Model achieves over 80% ranking accuracy in zero-shot across 14 languages.
Outperforms previous benchmarks in multilingual readability assessment.
Provides the first comprehensive overview of Wikipedia readability beyond English.
Abstract
With over 60M articles, Wikipedia has become the largest platform for open and freely accessible knowledge. While it has more than 15B monthly visits, its content is believed to be inaccessible to many readers due to the lack of readability of its text. However, previous investigations of the readability of Wikipedia have been restricted to English only, and there are currently no systems supporting the automatic readability assessment of the 300+ languages in Wikipedia. To bridge this gap, we develop a multilingual model to score the readability of Wikipedia articles. To train and evaluate this model, we create a novel multilingual dataset spanning 14 languages, by matching articles from Wikipedia to simplified Wikipedia and online children encyclopedias. We show that our model performs well in a zero-shot scenario, yielding a ranking accuracy of more than 80% across 14 languages and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsText Readability and Simplification · Wikis in Education and Collaboration · Natural Language Processing Techniques
