ADI-20: Arabic Dialect Identification dataset and models
Haroun Elleuch, Salima Mdhaffar, Yannick Est\`eve, Fethi Bougares

TL;DR
This paper introduces ADI-20, a comprehensive Arabic dialect dataset with 3,556 hours covering all dialects and MSA, and evaluates state-of-the-art models for dialect identification, analyzing data size and model complexity effects.
Contribution
The paper provides a new extensive Arabic dialect dataset, ADI-20, and evaluates multiple models, including fine-tuned ECAPA-TDNN and Whisper-based systems, for dialect identification.
Findings
Small decrease in F1 score with 30% training data
Model performance affected by data size and parameter count
Open-sourced dataset and models for future research
Abstract
We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries' dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model's number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Linguistic Variation and Morphology · Speech Recognition and Synthesis
