Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World
Surangika Ranathunga, Nisansa de Silva

TL;DR
This paper investigates the multifaceted linguistic disparities in NLP, revealing that data availability, language family, geography, and socio-economic factors contribute to unequal resource distribution across languages.
Contribution
It offers a comprehensive analysis of linguistic disparity in NLP, challenging simple data-based classifications and examining multiple factors influencing resource allocation.
Findings
Many languages lack coverage in NLP resources and platforms.
Disparities exist even within the same language groups.
Factors like language family, geography, and GDP impact resource distribution.
Abstract
Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently. However, different facets of this problem, or the reasons behind this disparity are seldom discussed within the NLP community. This paper provides a comprehensive analysis of the disparity that exists within the languages of the world. We show that simply categorising languages considering data availability may not be always correct. Using an existing language categorisation based on speaker population and vitality, we analyse the distribution of language data resources, amount of NLP/CL research, inclusion in multilingual web-based platforms and the inclusion in pre-trained multilingual models. We show that many languages do not get covered in these resources or platforms, and even within the languages belonging to the same language group, there is wide disparity. We analyse the impact of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Topic Modeling
