Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

Zhiyin Tan; Changxu Duan

arXiv:2605.17442·cs.CL·May 19, 2026

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

Zhiyin Tan, Changxu Duan

PDF

1 Repo

TL;DR

This paper investigates the discrepancy between dataset visibility in catalogues and actual research activity for low-resource languages, revealing a significant gap and emphasizing the importance of documentation and discoverability.

Contribution

It introduces the Resource Density Index (RDI) and combines catalogue data with literature mining to uncover dataset activity in low-resource languages.

Findings

01

Many large languages have low catalogue visibility but active datasets in research literature.

02

A substantial number of datasets are openly accessible despite low catalogue records.

03

The visibility gap affects multilingual NLP data availability and documentation.

Abstract

Multilingual NLP often relies on dataset counts from centralized catalogues to characterize which languages are resource-rich or resource-poor. However, these catalogues record only one layer of dataset visibility: what has been registered or institutionally distributed. They do not necessarily reflect which datasets are created, cited, or reused in the research literature. To examine this gap, we combine a catalogue-based baseline with literature-backed evidence of dataset circulation. We introduce the Resource Density Index (RDI), defined as the number of catalogued datasets per one million speakers, and compute it for the 200 most widely spoken languages in Ethnologue. Among them, 118 languages (59%) have an average RDI of zero across the LRE Map and the Linguistic Data Consortium (LDC), and another 23 fall below 0.1, corresponding to at most one catalogued dataset per ten million…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhiyintan/dataset-visibility-asymmetry
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.