Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review
Josh McGiff, Nikola S. Nikolov

TL;DR
This systematic review examines strategies to mitigate data scarcity in generative language models for low-resource languages, highlighting current approaches, challenges, and future directions for inclusive NLP development.
Contribution
It provides a comprehensive categorization and evaluation of existing methods addressing data scarcity in low-resource language modeling, offering insights for expanding equitable NLP tools.
Findings
Transformer-based models dominate current approaches
Limited diversity in low-resource languages studied
Evaluation methods lack consistency across studies
Abstract
Generative language modelling has surged in popularity with the emergence of services such as ChatGPT and Google Gemini. While these models have demonstrated transformative potential in productivity and communication, they overwhelmingly cater to high-resource languages like English. This has amplified concerns over linguistic inequality in natural language processing (NLP). This paper presents the first systematic review focused specifically on strategies to address data scarcity in generative language modelling for low-resource languages (LRL). Drawing from 54 studies, we identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training, and prompt engineering, across generative tasks. We also analyse trends in architecture choices, language family representation, and evaluation methods. Our findings highlight a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · ICT in Developing Communities · Topic Modeling
