Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources
Xinyan Velocity Yu, Akari Asai, Trina Chatterjee, Junjie Hu, Eunsol, Choi

TL;DR
This paper provides a detailed analysis of 156 multilingual NLP datasets, highlighting resource disparities across languages, and offers strategies for improving data collection in low-resource languages.
Contribution
It introduces a comprehensive annotation of dataset creation, quality assessment, and proposes practical strategies for enhancing multilingual data collection.
Findings
Resource disparities are significant across languages.
Crowdsourcing can effectively improve multilingual data quality.
Estimated researcher and crowd worker availability correlates with dataset presence.
Abstract
While the NLP community is generally aware of resource disparities among languages, we lack research that quantifies the extent and types of such disparity. Prior surveys estimating the availability of resources based on the number of datasets can be misleading as dataset quality varies: many datasets are automatically induced or translated from English data. To provide a more comprehensive picture of language resources, we examine the characteristics of 156 publicly available NLP datasets. We manually annotate how they are created, including input text and label sources and tools used to build them, and what they study, tasks they address and motivations for their creation. After quantifying the qualitative NLP resource gap across languages, we discuss how to improve data collection in low-resource languages. We survey language-proficient NLP researchers and crowd workers per language,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWikis in Education and Collaboration · Mobile Crowdsensing and Crowdsourcing · Topic Modeling
MethodsAttentive Walk-Aggregating Graph Neural Network
