3LM: Bridging Arabic, STEM, and Code through Benchmarking

Basma El Amel Boussaha; Leen AlQadi; Mugariya Farooq; Shaikha Alsuwaidi; Giulia Campesan; Ahmed Alzubaidi; Mohammed Alyafeai; Hakim Hacid

arXiv:2507.15850·cs.CL·July 28, 2025

3LM: Bridging Arabic, STEM, and Code through Benchmarking

Basma El Amel Boussaha, Leen AlQadi, Mugariya Farooq, Shaikha Alsuwaidi, Giulia Campesan, Ahmed Alzubaidi, Mohammed Alyafeai, Hakim Hacid

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces 3LM, a comprehensive benchmark suite for Arabic language models, covering STEM and code domains, to address the lack of evaluation resources in these critical areas.

Contribution

The paper presents the first dedicated Arabic benchmarks for STEM and code, including question-answer pairs, synthetic questions, and translated code benchmarks, to advance Arabic LLM research.

Findings

01

Benchmarks publicly released for community use

02

Supports evaluation of Arabic LLMs in STEM and coding

03

Addresses a significant gap in Arabic NLP resources

Abstract

Arabic is one of the most widely spoken languages in the world, yet efforts to develop and evaluate Large Language Models (LLMs) for Arabic remain relatively limited. Most existing Arabic benchmarks focus on linguistic, cultural, or religious content, leaving a significant gap in domains like STEM and code which are increasingly relevant for real-world LLM applications. To help bridge this gap, we present 3LM, a suite of three benchmarks designed specifically for Arabic. The first is a set of STEM-related question-answer pairs, naturally sourced from Arabic textbooks and educational worksheets. The second consists of synthetically generated STEM questions, created using the same sources. The third benchmark focuses on code generation, built through a careful translation of two widely used code benchmarks, incorporating a human-in-the-loop process with several rounds of review to ensure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

we-z/Arabic-STEM-MCQ
dataset· 12 dl
12 dl

Videos

3LM: Bridging Arabic, STEM, and Code through Benchmarking· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification