The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

Chris Emezue; NaijaVoices Community; Busayo Awobade; Abraham Owodunni; Handel Emezue; Gloria Monica Tobechukwu Emezue; Nefertiti Nneoma Emezue; Sewade Ogun; Bunmi Akinremi; David Ifeoluwa Adelani; Chris Pal

arXiv:2505.20564·cs.CL·July 15, 2025

The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

Chris Emezue, NaijaVoices Community, Busayo Awobade, Abraham Owodunni, Handel Emezue, Gloria Monica Tobechukwu Emezue, Nefertiti Nneoma Emezue, Sewade Ogun, Bunmi Akinremi, David Ifeoluwa Adelani, Chris Pal

PDF

Open Access 2 Datasets

TL;DR

The NaijaVoices dataset provides a large-scale, diverse speech corpus for African languages, enabling significant improvements in speech recognition models and addressing the under-representation of these languages in technology.

Contribution

We introduce the NaijaVoices dataset, a large, high-quality speech-text resource for African languages, with a novel data collection approach and demonstrated impact on speech recognition performance.

Findings

01

Achieved up to 75.86% WER reduction with Whisper model

02

Demonstrated dataset's acoustic diversity and scalability

03

Enhanced multilingual speech processing for African languages

Abstract

The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages -- including our focus, Igbo, Hausa, and Yoruba -- remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis