TL;DR
This paper investigates the discrepancy between dataset visibility in catalogues and actual research activity for low-resource languages, revealing a significant gap and emphasizing the importance of documentation and discoverability.
Contribution
It introduces the Resource Density Index (RDI) and combines catalogue data with literature mining to uncover dataset activity in low-resource languages.
Findings
Many large languages have low catalogue visibility but active datasets in research literature.
A substantial number of datasets are openly accessible despite low catalogue records.
The visibility gap affects multilingual NLP data availability and documentation.
Abstract
Multilingual NLP often relies on dataset counts from centralized catalogues to characterize which languages are resource-rich or resource-poor. However, these catalogues record only one layer of dataset visibility: what has been registered or institutionally distributed. They do not necessarily reflect which datasets are created, cited, or reused in the research literature. To examine this gap, we combine a catalogue-based baseline with literature-backed evidence of dataset circulation. We introduce the Resource Density Index (RDI), defined as the number of catalogued datasets per one million speakers, and compute it for the 200 most widely spoken languages in Ethnologue. Among them, 118 languages (59%) have an average RDI of zero across the LRE Map and the Linguistic Data Consortium (LDC), and another 23 fall below 0.1, corresponding to at most one catalogued dataset per ten million…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
