Dataset Geography: Mapping Language Data to Language Users
Fahim Faisal, Yinkai Wang, Antonios Anastasopoulos

TL;DR
This paper investigates the geographical distribution of NLP datasets to assess how well they represent language speakers worldwide, highlighting disparities and suggesting improvements for more inclusive language technology development.
Contribution
It introduces a methodology to analyze the geographical representativeness of NLP datasets and examines cross-lingual consistency and economic factors influencing data distribution.
Findings
Identifies geographical disparities in NLP dataset coverage
Provides insights into cross-lingual consistency of entity recognition systems
Suggests factors influencing dataset distribution such as economic variables
Abstract
As language technologies become more ubiquitous, there are increasing efforts towards expanding the language diversity and coverage of natural language processing (NLP) systems. Arguably, the most important factor influencing the quality of modern NLP systems is data availability. In this work, we study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers. In doing so, we use entity recognition and linking systems, also making important observations about their cross-lingual consistency and giving suggestions for more robust evaluation. Last, we explore some geographical and economic factors that may explain the observed dataset distributions. Code and data are available here: https://github.com/ffaisal93/dataset_geography. Additional visualizations are available here:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
