Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition
Ayman Mansour

TL;DR
This study explores data augmentation techniques to improve Sudanese dialect speech recognition, establishing a new benchmark and demonstrating effective low-resource model training using low-cost resources.
Contribution
It introduces a combined self-training and TTS augmentation approach for Sudanese dialect ASR and provides the first benchmark for this low-resource language.
Findings
Best model achieves 57.1% WER on evaluation set
Outperforms zero-shot multilingual Whisper and MSA models
Uses low-cost resources for effective model training
Abstract
Although many Automatic Speech Recognition (ASR) systems have been developed for Modern Standard Arabic (MSA) and Dialectal Arabic (DA), few studies have focused on dialect-specific implementations, particularly for low-resource Arabic dialects such as Sudanese. This paper presents a comprehensive study of data augmentation techniques for fine-tuning OpenAI Whisper models and establishes the first benchmark for the Sudanese dialect. Two augmentation strategies are investigated: (1) self-training with pseudo-labels generated from unlabeled speech, and (2) TTS-based augmentation using synthetic speech from the Klaam TTS system. The best-performing model, Whisper-Medium fine-tuned with combined self-training and TTS augmentation (28.4 hours), achieves a Word Error Rate (WER) of 57.1% on the evaluation set and 51.6% on an out-of-domain holdout set substantially outperforming zero-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Linguistic Variation and Morphology
