Dealing with the Hard Facts of Low-Resource African NLP

Yacouba Diarra; Nouhoum Souleymane Coulibaly; Panga Azazia Kamat\'e; Madani Amadou Tall; Emmanuel \'Elis\'e Kon\'e; Aymane Demb\'el\'e; Michael Leventhal

arXiv:2511.18557·cs.CL·November 25, 2025

Dealing with the Hard Facts of Low-Resource African NLP

Yacouba Diarra, Nouhoum Souleymane Coulibaly, Panga Azazia Kamat\'e, Madani Amadou Tall, Emmanuel \'Elis\'e Kon\'e, Aymane Demb\'el\'e, Michael Leventhal

PDF

Open Access 2 Datasets

TL;DR

This paper documents the collection and annotation of a large Bambara speech dataset, develops small models for low-resource NLP, and emphasizes the importance of human evaluation, providing resources for future research.

Contribution

It introduces a substantial Bambara speech dataset, semi-automated annotation methods, and small models, along with practical guidelines and publicly available evaluation resources.

Findings

01

Large Bambara speech dataset created and annotated

02

Small monolingual models developed and evaluated

03

Human evaluation proves crucial for model assessment

Abstract

Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsICT in Developing Communities · Language and cultural evolution · Natural Language Processing Techniques