A practical approach to language complexity: a Wikipedia case study
Taha Yasseri, Andr\'as Kornai, and J\'anos Kert\'esz

TL;DR
This study empirically compares language complexity in Simple and Main English Wikipedia, revealing that simpler articles mainly use shorter sentences rather than simpler vocabulary or syntax, with complexity varying by topic and controversy.
Contribution
It provides a detailed empirical analysis of language complexity differences between Simple and Main Wikipedia, highlighting sentence length as a key factor and exploring topical and conflict-related variations.
Findings
Simple Wikipedia uses shorter sentences than Main Wikipedia.
Vocabulary richness is similar in both Wikipedia versions.
Controversial articles tend to have less complex language.
Abstract
In this paper we present statistical analysis of English texts from Wikipedia. We try to address the issue of language complexity empirically by comparing the simple English Wikipedia (Simple) to comparable samples of the main English Wikipedia (Main). Simple is supposed to use a more simplified language with a limited vocabulary, and editors are explicitly requested to follow this guideline, yet in practice the vocabulary richness of both samples are at the same level. Detailed analysis of longer units (n-grams of words and part of speech tags) shows that the language of Simple is less complex than that of Main primarily due to the use of shorter sentences, as opposed to drastically simplified syntax or vocabulary. Comparing the two language varieties by the Gunning readability index supports this conclusion. We also report on the topical dependence of language complexity, e.g. that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
