ADI-20: Arabic Dialect Identification dataset and models

Haroun Elleuch; Salima Mdhaffar; Yannick Est\`eve; Fethi Bougares

arXiv:2511.10070·cs.CL·November 14, 2025

ADI-20: Arabic Dialect Identification dataset and models

Haroun Elleuch, Salima Mdhaffar, Yannick Est\`eve, Fethi Bougares

PDF

Open Access

TL;DR

This paper introduces ADI-20, a comprehensive Arabic dialect dataset with 3,556 hours covering all dialects and MSA, and evaluates state-of-the-art models for dialect identification, analyzing data size and model complexity effects.

Contribution

The paper provides a new extensive Arabic dialect dataset, ADI-20, and evaluates multiple models, including fine-tuned ECAPA-TDNN and Whisper-based systems, for dialect identification.

Findings

01

Small decrease in F1 score with 30% training data

02

Model performance affected by data size and parameter count

03

Open-sourced dataset and models for future research

Abstract

We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries' dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model's number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Linguistic Variation and Morphology · Speech Recognition and Synthesis