MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark

Mouath Abu-Daoud; Leen Kharouf; Omar El Hajj; Dana El Samad; Mariam Al-Omari; Jihad Mallat; Khaled Saleh; Nizar Habash; Farah E. Shamout

arXiv:2602.01714·cs.CL·February 3, 2026

MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark

Mouath Abu-Daoud, Leen Kharouf, Omar El Hajj, Dana El Samad, Mariam Al-Omari, Jihad Mallat, Khaled Saleh, Nizar Habash, Farah E. Shamout

PDF

Open Access 3 Reviews

TL;DR

MedAraBench is a comprehensive Arabic medical question-answering dataset that enables evaluation and development of multilingual LLMs in healthcare, addressing a critical resource gap for Arabic NLP in medicine.

Contribution

The paper introduces MedAraBench, a large-scale, high-quality Arabic medical dataset with diverse specialties, and provides benchmark evaluations of current models to advance Arabic medical NLP research.

Findings

01

Existing models show room for domain-specific improvements.

02

The dataset covers 19 specialties and five difficulty levels.

03

Benchmark results highlight the need for further model enhancements.

Abstract

Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

The paper presents a useful resource for Arabic medical understanding, and benchmarking results for a number of SOTA models.

Weaknesses

Some missing details about the construction of the dataset and the implementation of the benchmarking are mentioned in Questions below

Reviewer 02Rating 4Confidence 4

Strengths

- The manual digitization and expert validation of data from non-digital academic sources shows significant effort and ensure the dataset’s authenticity and reliability. - The dataset spans 19 medical specialties at various difficulty levels, offering a structured framework that supports fine grained evaluation of LLM performance across various domains of medical knowledge for the Arabic language.

Weaknesses

- For validating data quality using LLM-as-a-judge, the authors employ GPT-4, Gemini 1.5 Pro, and Claude 3.5 Sonnet. However, there is no justification provided for selecting these specific models, Are they known to outperform others in Arabic understanding? Moreover, the prompt instructs the models to act as medical education expert but does not account for the Arabic language aspect of the task. The capability of these models in Arabic medical understanding needs further evaluation. - The lite

Reviewer 03Rating 2Confidence 4

Strengths

1. Significant manual data cleaning and digitization effort, which adds credibility and quality to the dataset. 2. Diverse specialty coverage and structured annotation across difficulty levels, ensuring representativeness within the medical domain. 3. Inclusion of a human expert evaluation component, which is commendable and adds qualitative depth to the study. 4. Contributes to Arabic NLP, a domain with limited existing benchmarks and resources.

Weaknesses

1. Unjustified selection of evaluator LLMs (Section 3.2.2) The paper provides no justification for the selection of the three LLMs used as evaluators in the LLM-as-a-judge setup. There is no discussion of why these particular models were chosen, nor any rationale for excluding medical or Arabic-specialized LLMs, such as BiMediX (arabic+medical) or Fanar(arabic) or medgemma(medical+multilingual) etc . A more rigorous approach would have been to compare multiple candidate evaluators and measure t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare