Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning
\"Omer Tarik \"Ozyilmaz, Matt Coler, Matias Valdenegro-Toro

TL;DR
This paper explores fine-tuning Whisper for Arabic dialects, showing that pooling dialect data effectively mitigates data scarcity and improves dialectal ASR performance.
Contribution
It demonstrates that dialect-pooled models can match dialect-specific models, reducing the need for large dialect-specific datasets in Arabic ASR.
Findings
Small MSA fine-tuning improves smaller models significantly.
Pre-training on MSA offers limited benefits for dialectal speech.
Dialect-pooled models perform comparably to dialect-specific models.
Abstract
Although commercial Arabic automatic speech recognition (ASR) systems support Modern Standard Arabic (MSA), they struggle with dialectal speech. We investigate the effect of fine-tuning OpenAI's Whisper on five major Arabic dialects (Gulf, Levantine, Iraqi, Egyptian, Maghrebi) using Mozilla Common Voice for MSA and the MASC dataset for dialectal speech. We evaluate MSA training size effects, benefits of pre-training on MSA data, and dialect-specific versus dialect-pooled models. We find that small amounts of MSA fine-tuning data yield substantial improvements for smaller models, matching larger non-fine-tuned models. While MSA pre-training shows minimal benefit, suggesting limited shared features between MSA and dialects, our dialect-pooled models perform comparably to dialect-specific ones. This indicates that pooling dialectal data, when properly balanced, can help address data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
