Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues

Muhammad Dehan Al Kautsar; Saeed Almheiri; Momina Ahsan; Bilal Elbouardi; Younes Samih; Sarfraz Ahmad; Amr Keleg; Omar El Herraoui; Kareem Elzeky; Abed Alhakim Freihat; Mohamed Anwar; Zhuohan Xie; Junhong Liang; Mohammad Rustom Al Nasar; Preslav Nakov; Fajri Koto

arXiv:2605.00119·cs.CL·May 4, 2026

Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues

Muhammad Dehan Al Kautsar, Saeed Almheiri, Momina Ahsan, Bilal Elbouardi, Younes Samih, Sarfraz Ahmad, Amr Keleg, Omar El Herraoui, Kareem Elzeky, Abed Alhakim Freihat, Mohamed Anwar, Zhuohan Xie, Junhong Liang, Mohammad Rustom Al Nasar, Preslav Nakov, Fajri Koto

PDF

TL;DR

This paper introduces ArabCulture-Dialogue, a comprehensive dataset for evaluating cultural reasoning in Arabic language models across dialects and MSA, highlighting existing performance gaps.

Contribution

It presents a new culturally grounded conversational dataset covering 13 countries and three benchmarking tasks to assess dialectal and MSA Arabic understanding.

Findings

01

Models perform worse on dialects than MSA across tasks.

02

The dataset covers 12 topics and 54 subtopics.

03

Benchmarking reveals significant performance gaps.

Abstract

There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. Most Arabic benchmarks focus on short text snippets in Modern Standard Arabic (MSA), overlooking the cultural nuances that naturally arise in dialogues. To address this gap, we introduce ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries, in both MSA and each country's respective dialect, spanning 12 daily-life topics and 54 fine-grained subtopics. We utilize the dataset to form three benchmarking tasks: (i) multiple-choice cultural reasoning, (ii) machine translation between MSA and dialects, and (iii) dialect-steering generation. Our experiments indicate that the performance gap between MSA and Arabic dialects still exists, whereby the models perform worse on all three tasks in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.