Synthetic Voice Data for Automatic Speech Recognition in African Languages

Brian DeRenzi; Anna Dixon; Mohamed Aymane Farhi; Christian Resch

arXiv:2507.17578·cs.CL·November 10, 2025

Synthetic Voice Data for Automatic Speech Recognition in African Languages

Brian DeRenzi, Anna Dixon, Mohamed Aymane Farhi, Christian Resch

PDF

Open Access

TL;DR

This paper systematically assesses the use of large-scale synthetic voice data generated via LLM-driven text creation and TTS for improving African language ASR, demonstrating cost-effective performance gains and highlighting the need for better evaluation protocols.

Contribution

It introduces a novel three-step process for creating synthetic African language speech data and evaluates its effectiveness in improving ASR performance across multiple low-resource languages.

Findings

01

Synthetic data achieved high readability scores in most languages.

02

ASR models trained on synthetic data matched or exceeded real-data baselines.

03

Cost-effective synthetic data generation significantly improved low-resource language ASR.

Abstract

Speech technology remains out of reach for most of the over 2300 languages in Africa. We present the first systematic assessment of large-scale synthetic voice corpora for African ASR. We apply a three-step process: LLM-driven text creation, TTS voice synthesis, and ASR fine-tuning. Eight out of ten languages for which we create synthetic text achieved readability scores above 5 out of 7. We evaluated ASR improvement for three (Hausa, Dholuo, Chichewa) and created more than 2,500 hours of synthetic voice data at below 1% of the cost of real data. Fine-tuned Wav2Vec-BERT-2.0 models trained on 250h real and 250h synthetic Hausa matched a 500h real-data-only baseline, while 579h real and 450h to 993h synthetic data created the best performance. We also present gender-disaggregated ASR performance evaluation. For very low-resource languages, gains varied: Chichewa WER improved about 6.5%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing