ARCADE: A City-Scale Corpus for Fine-Grained Arabic Dialect Tagging

Omer Nacar; Serry Sibaee; Adel Ammar; Yasser Alhabashi; Nadia Samer Sibai; Yara Farouk Ahmed; Ahmed Saud Alqusaiyer; Sulieman Mahmoud AlMahmoud; Abdulrhman Mamdoh Mukhaniq; Lubaba Raed; Sulaiman Mohammed Alatwah; Waad Nasser Alqahtani; Yousif Abdulmajeed Alnasser; Mohamed Aziz Khadraoui; Wadii Boulila

arXiv:2601.02209·cs.CL·January 6, 2026

ARCADE: A City-Scale Corpus for Fine-Grained Arabic Dialect Tagging

Omer Nacar, Serry Sibaee, Adel Ammar, Yasser Alhabashi, Nadia Samer Sibai, Yara Farouk Ahmed, Ahmed Saud Alqusaiyer, Sulieman Mahmoud AlMahmoud, Abdulrhman Mamdoh Mukhaniq, Lubaba Raed, Sulaiman Mohammed Alatwah, Waad Nasser Alqahtani, Yousif Abdulmajeed Alnasser

PDF

Open Access 1 Datasets

TL;DR

ARCADE is a pioneering city-level Arabic dialect speech dataset derived from radio streams, enabling detailed dialect identification and supporting advanced multi-task learning for linguistic research.

Contribution

This paper introduces ARCADE, the first large-scale city-specific Arabic speech corpus with detailed annotations, filling a gap in dialect mapping and supporting fine-grained dialect tagging.

Findings

01

Dataset includes 3,790 audio segments from 58 cities.

02

Annotations cover dialect, emotion, speech type, and validity.

03

Supports multi-task learning for dialect identification.

Abstract

The Arabic language is characterized by a rich tapestry of regional dialects that differ substantially in phonetics and lexicon, reflecting the geographic and cultural diversity of its speakers. Despite the availability of many multi-dialect datasets, mapping speech to fine-grained dialect sources, such as cities, remains underexplored. We present ARCADE (Arabic Radio Corpus for Audio Dialect Evaluation), the first Arabic speech dataset designed explicitly with city-level dialect granularity. The corpus comprises Arabic radio speech collected from streaming services across the Arab world. Our data pipeline captures 30-second segments from verified radio streams, encompassing both Modern Standard Arabic (MSA) and diverse dialectal speech. To ensure reliability, each clip was annotated by one to three native Arabic reviewers who assigned rich metadata, including emotion, speech type,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

riotu-lab/ARCADE-full
dataset· 78 dl
78 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLinguistic Variation and Morphology · Speech Recognition and Synthesis · Authorship Attribution and Profiling