BALSAM: A Platform for Benchmarking Arabic Large Language Models

Rawan Al-Matham; Kareem Darwish; Raghad Al-Rasheed; Waad Alshammari; Muneera Alhoshan; Amal Almazrua; Asma Al Wazrah; Mais Alheraki; Firoj Alam; Preslav Nakov; Norah Alzahrani; Eman alBilali; Nizar Habash; Abdelrahman El-Sheikh; Muhammad Elmallah; Haonan Li; Hamdy Mubarak; Mohamed Anwar; Zaid Alyafeai; Ahmed Abdelali; Nora Altwairesh; Maram Hasanain; Abdulmohsen Al Thubaity; Shady Shehata; Bashar Alhafni; Injy Hamed; Go Inoue; Khalid Elmadani; Ossama Obeid; Fatima Haouari; Tamer Elsayed; Emad Alghamdi; Khalid Almubarak; Saied Alshahrani; Ola Aljarrah; Safa Alajlan; Areej Alshaqarawi; Maryam Alshihri; Sultana Alghurabi; Atikah Alzeghayer; Afrah Altamimi; Abdullah Alfaifi; Abdulrahman AlOsaimy

arXiv:2507.22603·cs.CL·July 31, 2025

BALSAM: A Platform for Benchmarking Arabic Large Language Models

Rawan Al-Matham, Kareem Darwish, Raghad Al-Rasheed, Waad Alshammari, Muneera Alhoshan, Amal Almazrua, Asma Al Wazrah, Mais Alheraki, Firoj Alam, Preslav Nakov, Norah Alzahrani, Eman alBilali, Nizar Habash, Abdelrahman El-Sheikh, Muhammad Elmallah, Haonan Li, Hamdy Mubarak

PDF

1 Video

TL;DR

BALSAM is a comprehensive, community-driven benchmarking platform designed to evaluate and advance Arabic Large Language Models across diverse NLP tasks with transparent, blind testing.

Contribution

It introduces a large, diverse Arabic benchmark with 78 tasks and a centralized platform for unbiased evaluation, addressing previous limitations in Arabic NLP benchmarking.

Findings

01

Provides 78 NLP tasks covering broad categories

02

Includes 52K examples with blind test sets

03

Establishes a transparent platform for Arabic LLM evaluation

Abstract

The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

BALSAM: A Platform for Benchmarking Arabic Large Language Models· underline