Contextualising Levels of Language Resourcedness that affect NLP tasks
C. Maria Keet, Langa Khumalo

TL;DR
This paper proposes a nuanced framework for classifying languages based on their resource availability for NLP, moving beyond the simple high/low resource dichotomy to improve research planning and resource allocation.
Contribution
It introduces a new matrix-based typology that characterizes languages along a resource spectrum based on contextual societal features, especially focusing on African languages.
Findings
Develops a matrix for language resource classification
Provides contextual features for each resource level
Enhances understanding of language resource distribution
Abstract
Several widely used software applications involve some form of processing of natural language, with tasks ranging from digitising hardcopies and text processing to speech generation. Varied language resources are used to develop software systems to accomplish a wide range of natural language processing (NLP) tasks, such as the ubiquitous spellcheckers and chatbots. Languages are typically characterised as either low (LRL) or high resourced languages (HRL) with African languages having been characterised as resource-scarce languages and English by far the most well-resourced language. But what lies in-between? We argue that the dichotomous typology of LRL and HRL for all languages is problematic. Through a clear understanding of language resources situated in a society, a matrix is developed that characterises languages as Very LRL, LRL, RL, HRL and Very HRL. The characterisation is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecond Language Learning and Teaching
MethodsFocus
