AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Mohammad Zbeeb; Hasan Abed Al Kader Hammoud; Sina Mukalled; Nadine Rizk; Fatima Karnib; Issam Lakkis; Ammar Mohanna; Bernard Ghanem

arXiv:2511.14295·cs.CL·December 10, 2025

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Mohammad Zbeeb, Hasan Abed Al Kader Hammoud, Sina Mukalled, Nadine Rizk, Fatima Karnib, Issam Lakkis, Ammar Mohanna, Bernard Ghanem

PDF

Open Access 1 Datasets 1 Video

TL;DR

AraLingBench is a human-annotated benchmark designed to evaluate Arabic language models across core linguistic skills, revealing that current models excel at surface tasks but lack deep grammatical and syntactic understanding.

Contribution

The paper introduces AraLingBench, the first comprehensive human-annotated benchmark for Arabic linguistic capabilities of large language models, focusing on fundamental language skills.

Findings

01

Models perform well on knowledge-based tasks but poorly on grammatical and syntactic reasoning.

02

Current models rely on memorization rather than genuine language understanding.

03

AraLingBench exposes gaps in linguistic mastery of Arabic LLMs.

Abstract

We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

hammh0a/AraLingBench
dataset· 26 dl
26 dl

Videos

AraLingBench: A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification