Synthetic Voice Data for Automatic Speech Recognition in African Languages
Brian DeRenzi, Anna Dixon, Mohamed Aymane Farhi, Christian Resch

TL;DR
This paper systematically assesses the use of large-scale synthetic voice data generated via LLM-driven text creation and TTS for improving African language ASR, demonstrating cost-effective performance gains and highlighting the need for better evaluation protocols.
Contribution
It introduces a novel three-step process for creating synthetic African language speech data and evaluates its effectiveness in improving ASR performance across multiple low-resource languages.
Findings
Synthetic data achieved high readability scores in most languages.
ASR models trained on synthetic data matched or exceeded real-data baselines.
Cost-effective synthetic data generation significantly improved low-resource language ASR.
Abstract
Speech technology remains out of reach for most of the over 2300 languages in Africa. We present the first systematic assessment of large-scale synthetic voice corpora for African ASR. We apply a three-step process: LLM-driven text creation, TTS voice synthesis, and ASR fine-tuning. Eight out of ten languages for which we create synthetic text achieved readability scores above 5 out of 7. We evaluated ASR improvement for three (Hausa, Dholuo, Chichewa) and created more than 2,500 hours of synthetic voice data at below 1% of the cost of real data. Fine-tuned Wav2Vec-BERT-2.0 models trained on 250h real and 250h synthetic Hausa matched a 500h real-data-only baseline, while 579h real and 450h to 993h synthetic data created the best performance. We also present gender-disaggregated ASR performance evaluation. For very low-resource languages, gains varied: Chichewa WER improved about 6.5%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
