Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs

Abdellah El Mekki; Samar M. Magdy; Houdaifa Atou; Ruwa AbuHweidi; Baraah Qawasmeh; Omer Nacar; Thikra Al-hibiri; Razan Saadie; Hamzah Alsayadi; Nadia Ghezaiel Hammouda; Alshima Alkhazimi; Aya Hamod; Al-Yas Al-Ghafri; Wesam El-Sayed; Asila Al sharji; Mohamad Ballout; Anas Belfathi; Karim Ghaddar; Serry Sibaee; Alaa Aoun; Areej Asiri; Lina Abureesh; Ahlam Bashiti; Majdal Yousef; Abdulaziz Hafiz; Yehdih Mohamed; Emira Hamedtou; Brakehe Brahim; Rahaf Alhamouri; Youssef Nafea; Aya El Aatar; Walid Al-Dhabyani; Emhemed Hamed; Sara Shatnawi; Fakhraddin Alwajih; Khalid Elkhidir; Ashwag Alasmari; Abdurrahman Gerrio; Omar Alshahri; AbdelRahim A. Elmadany; Ismail Berrada; Amir Azad Adli Alkathiri; Fadi A Zaraket; Mustafa Jarrar; Yahya Mohamed El Hadj; Hassan Alhuzali; Muhammad Abdul-Mageed

arXiv:2601.13099·cs.CL·April 21, 2026

Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs

Abdellah El Mekki, Samar M. Magdy, Houdaifa Atou, Ruwa AbuHweidi, Baraah Qawasmeh, Omer Nacar, Thikra Al-hibiri, Razan Saadie, Hamzah Alsayadi, Nadia Ghezaiel Hammouda, Alshima Alkhazimi, Aya Hamod, Al-Yas Al-Ghafri, Wesam El-Sayed, Asila Al sharji, Mohamad Ballout

PDF

1 Repo 1 Datasets

TL;DR

Alexandria is a comprehensive, multi-domain dialectal Arabic dataset designed to improve machine translation and LLM performance across diverse Arabic dialects with detailed metadata and evaluation benchmarks.

Contribution

It introduces a large-scale, community-driven dataset with city-level dialect metadata and gender annotations, filling a gap in resources for dialectal Arabic translation.

Findings

01

Current LLMs show significant challenges in translating diverse Arabic dialects.

02

The dataset enables detailed evaluation of dialectal and gender-conditioned translation performance.

03

Benchmark results highlight persistent gaps in dialectal Arabic machine translation capabilities.

Abstract

Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic (MSA). Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce Alexandria, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of parallel English-Dialectal Arabic multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UBC-NLP/Alexandria
github

Datasets

UBC-NLP/alexandria
dataset· 1.8k dl
1.8k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.