Robust clustering of languages across Wikipedia growth
Kristina Ban, Matjaz Perc, Zoran Levnajic

TL;DR
This study analyzes the growth patterns of 26 Wikipedia language editions over 15 years, revealing six robust clusters that are independent of language families and influenced by cultural and informational factors.
Contribution
The paper introduces a robust clustering approach to identify growth patterns across Wikipedia languages, independent of traditional language family classifications.
Findings
Six well-defined clusters of Wikipedias with shared growth patterns
Clusters are consistent across four different clustering methods
Growth factors are linked to cultural and informational similarities, not language families
Abstract
Wikipedia is the largest existing knowledge repository that is growing on a genuine crowdsourcing support. While the English Wikipedia is the most extensive and the most researched one with over five million articles, comparatively little is known about the behavior and growth of the remaining 283 smaller Wikipedias, the smallest of which, Afar, has only one article. Here we use a subset of this data, consisting of 14962 different articles, each of which exists in 26 different languages, from Arabic to Ukrainian. We study the growth of Wikipedias in these languages over a time span of 15 years. We show that, while an average article follows a random path from one language to another, there exist six well-defined clusters of Wikipedias that share common growth patterns. The make-up of these clusters is remarkably robust against the method used for their determination, as we verify via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
