Ara-Best-RQ: Multi Dialectal Arabic SSL
Haroun Elleuch, Ryan Whetten, Salima Mdhaffar, Yannick Est\`eve, Fethi Bougares

TL;DR
Ara-BEST-RQ introduces a family of self-supervised models tailored for multi-dialectal Arabic speech, achieving state-of-the-art dialect identification and competitive speech recognition with fewer parameters, advancing Arabic speech technology.
Contribution
The paper presents a new SSL model family specifically designed for multi-dialectal Arabic, demonstrating improved performance over existing models and releasing resources for future research.
Findings
State-of-the-art dialect identification accuracy
Effective speech recognition with fewer parameters
Significant performance gains from dialect-targeted pre-training
Abstract
We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. Leveraging 5,640 hours of crawled Creative Commons speech and combining it with publicly available datasets, we pre-train conformer-based BEST-RQ models up to 600M parameters. Our models are evaluated on dialect identification (DID) and automatic speech recognition (ASR) tasks, achieving state-of-the-art performance on the former while using fewer parameters than competing models. We demonstrate that family-targeted pre-training on Arabic dialects significantly improves downstream performance compared to multilingual or monolingual models trained on non-Arabic data. All models, code, and pre-processed datasets will be publicly released to support reproducibility and further research in Arabic speech technologies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Authorship Attribution and Profiling · Natural Language Processing Techniques
