From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation

Serry Sibaee; Omer Nacar; Adel Ammar; Yasser Al-Habashi; Abdulrahman Al-Batati; Wadii Boulila

arXiv:2506.01920·cs.CL·June 3, 2025

From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation

Serry Sibaee, Omer Nacar, Adel Ammar, Yasser Al-Habashi, Abdulrahman Al-Batati, Wadii Boulila

PDF

Open Access

TL;DR

This paper introduces a comprehensive evaluation framework for Arabic language models, including a new dataset, revealing performance gaps and emphasizing cultural understanding.

Contribution

It establishes theoretical guidelines and presents the Arabic Depth Mini Dataset (ADMD) for more accurate Arabic LLM evaluation.

Findings

01

Claude 3.5 Sonnet achieved 30% accuracy overall

02

Significant performance variation across domains

03

Challenges remain in cultural and specialized knowledge areas

Abstract

This paper addresses critical gaps in Arabic language model evaluation by establishing comprehensive theoretical guidelines and introducing a novel evaluation framework. We first analyze existing Arabic evaluation datasets, identifying significant issues in linguistic accuracy, cultural alignment, and methodological rigor. To address these limitations in LLMs, we present the Arabic Depth Mini Dataset (ADMD), a carefully curated collection of 490 challenging questions spanning ten major domains (42 sub-domains, see Figure 1. Using ADMD, we evaluate five leading language models: GPT-4, Claude 3.5 Sonnet, Gemini Flash 1.5, CommandR 100B, and Qwen-Max. Our results reveal significant variations in model performance across different domains, with particular challenges in areas requiring deep cultural understanding and specialized knowledge. Claude 3.5 Sonnet demonstrated the highest overall…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques