Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara

Yacouba Diarra; Panga Azazia Kamate; Nouhoum Souleymane Coulibaly; Michael Leventhal

arXiv:2512.19400·cs.CL·December 23, 2025

Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara

Yacouba Diarra, Panga Azazia Kamate, Nouhoum Souleymane Coulibaly, Michael Leventhal

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces Kunkado, a comprehensive Bambara ASR dataset from radio archives, and demonstrates how finetuning models on this data improves recognition accuracy in real-world scenarios.

Contribution

The creation of Kunkado, a large-scale, realistic Bambara speech dataset, and the demonstration of improved ASR performance through finetuning on this data.

Findings

01

Finetuning reduces WER from 44.47% to 37.12%.

02

Finetuning reduces WER from 36.07% to 32.33%.

03

Models trained on Kunkado outperform those trained on cleaner speech.

Abstract

We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47\% to 37.12\% on one and from 36.07\% to 32.33\% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

RobotsMali/kunkado
dataset· 1.2k dl
1.2k dl

Videos

Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · ICT in Developing Communities · Phonetics and Phonology Research