Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning
Mahmoud Salhab, Shameed Sait, Mohammad Abusheikh, Hasan Abusheikh

TL;DR
This paper introduces a scalable training pipeline that combines weakly supervised pretraining and supervised fine-tuning to develop a high-performing, multi-dialectal Arabic speech recognition system, addressing low-resource challenges.
Contribution
It presents a novel approach that leverages large-scale weakly labeled data and continual fine-tuning to improve Arabic ASR across multiple dialects, achieving state-of-the-art results.
Findings
Achieved first place in the NADI 2025 Shared Task 2 for multi-dialectal Arabic ASR.
Demonstrated effectiveness of weak supervision combined with fine-tuning for low-resource languages.
Produced a robust Arabic ASR model capable of handling diverse dialects.
Abstract
Automatic speech recognition (ASR) plays a vital role in enabling natural human-machine interaction across applications such as virtual assistants, industrial automation, customer support, and real-time transcription. However, developing accurate ASR systems for low-resource languages like Arabic remains a significant challenge due to limited labeled data and the linguistic complexity introduced by diverse dialects. In this work, we present a scalable training pipeline that combines weakly supervised learning with supervised fine-tuning to develop a robust Arabic ASR model. In the first stage, we pretrain the model on 15,000 hours of weakly labeled speech covering both Modern Standard Arabic (MSA) and various Dialectal Arabic (DA) variants. In the subsequent stage, we perform continual supervised fine-tuning using a mixture of filtered weakly labeled data and a small, high-quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · ICT in Developing Communities
