Dealing with the Hard Facts of Low-Resource African NLP
Yacouba Diarra, Nouhoum Souleymane Coulibaly, Panga Azazia Kamat\'e, Madani Amadou Tall, Emmanuel \'Elis\'e Kon\'e, Aymane Demb\'el\'e, Michael Leventhal

TL;DR
This paper documents the collection and annotation of a large Bambara speech dataset, develops small models for low-resource NLP, and emphasizes the importance of human evaluation, providing resources for future research.
Contribution
It introduces a substantial Bambara speech dataset, semi-automated annotation methods, and small models, along with practical guidelines and publicly available evaluation resources.
Findings
Large Bambara speech dataset created and annotated
Small monolingual models developed and evaluated
Human evaluation proves crucial for model assessment
Abstract
Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsICT in Developing Communities · Language and cultural evolution · Natural Language Processing Techniques
