AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

Lilian Wanzare; Cynthia Amol; zekiel Maina; Nelson Odhiambo; Hope Kerubo; Leila Misula; Vivian Oloo; Rennish Mboya; Edwin Onkoba; Edward Ombui; Joseph Muguro; Ciira wa Maina; Andrew Kipkebut; Alfred Omondi Otom; Ian Ndung'u Kang'ethe; Angela Wambui Kanyi; Brian Gichana Omwenga

arXiv:2604.08448·cs.CL·April 10, 2026

AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

Lilian Wanzare, Cynthia Amol, zekiel Maina, Nelson Odhiambo, Hope Kerubo, Leila Misula, Vivian Oloo, Rennish Mboya, Edwin Onkoba, Edward Ombui, Joseph Muguro, Ciira wa Maina, Andrew Kipkebut, Alfred Omondi Otom, Ian Ndung'u Kang'ethe, Angela Wambui Kanyi, Brian Gichana Omwenga

PDF

1 Models

TL;DR

AfriVoices-KE offers a comprehensive, multilingual speech dataset for five Kenyan languages, supporting speech technology development and linguistic preservation with diverse, high-quality audio data from thousands of speakers.

Contribution

This work introduces a large-scale, multilingual speech dataset for Kenyan languages, combining scripted and spontaneous speech collected via mobile apps with quality assurance measures.

Findings

01

Collected 3,000 hours of speech data from 4,777 speakers.

02

Included both scripted and spontaneous speech across multiple domains.

03

Addressed challenges of data collection in low-resource settings.

Abstract

AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and demographics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource. Data collection followed a dual methodology: scripted recordings drew from compiled text corpora, translations, and domain-specific generated sentences spanning eleven domains relevant to the Kenyan context, while unscripted speech was elicited through textual and image prompts to capture natural linguistic variation and dialectal nuances. A customized mobile application enabled contributors…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Tonykip/whisper-kalenjin-v3-turbo
model· 388 dl· ♡ 1
388 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.