Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300   Language Editions

Marc Miquel-Rib\'e; David Laniado

arXiv:1901.07999·cs.CY·June 11, 2019

Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions

Marc Miquel-Rib\'e, David Laniado

PDF

TL;DR

This paper introduces the Wikipedia Cultural Diversity dataset, classifying articles across 300 language editions to analyze cultural representation and support cross-cultural research in digital humanities.

Contribution

It provides a comprehensive dataset with classification methodology and features for cultural context articles across multiple Wikipedia language editions.

Findings

01

Dataset covers 300 language editions.

02

Methodology for classifying cultural articles.

03

Potential applications in content gap analysis.

Abstract

In this paper we present the Wikipedia Cultural Diversity dataset. For each existing Wikipedia language edition, the dataset contains a classification of the articles that represent its associated cultural context, i.e. all concepts and entities related to the language and to the territories where it is spoken. We describe the methodology we employed to classify articles, and the rich set of features that we defined to feed the classifier, and that are released as part of the dataset. We present several purposes for which we envision the use of this dataset, including detecting, measuring and countering content gaps in the Wikipedia project, and encouraging cross-cultural research in the field of digital humanities.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.