A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets
Tanja Samardzic, Ximena Gutierrez, Christian Bentz, Steven Moran, Olga, Pelloni

TL;DR
This paper introduces a new measure for evaluating linguistic diversity in multilingual NLP datasets, considering structural language features and an automatic text-based approach to overcome data sparsity, enabling better dataset comparison and selection.
Contribution
It proposes a novel, interpretable diversity score based on linguistic features and a set comparison method, enhancing the assessment of multilingual datasets beyond simple language counts.
Findings
Identifies gaps in existing datasets, such as missing synthetic languages.
Provides a ranking of popular multilingual datasets based on linguistic diversity.
Demonstrates the effectiveness of the proposed measure across multiple datasets.
Abstract
Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP. Linguistic diversity of these data sets is typically measured as the number of languages or language families included in the sample, but such measures do not consider structural properties of the included languages. In this paper, we propose assessing linguistic diversity of a data set against a reference language sample as a means of maximising linguistic diversity in the long run. We represent languages as sets of features and apply a version of the Jaccard index suitable for comparing sets of measures. In addition to the features extracted from typological data bases, we propose an automatic text-based measure, which can be used as a means of overcoming the well-known problem of data sparsity in manually collected features. Our diversity score is interpretable in terms of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices
MethodsSparse Evolutionary Training · mBERT
