Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations

Mohammed Alkhowaiter; Norah Alshahrani; Saied Alshahrani; Reem I. Masoud; Alaa Alzahrani; Deema Alnuhait; Emad A. Alghamdi; Khalid Almubarak

arXiv:2507.14688·cs.CL·October 1, 2025

Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations

Mohammed Alkhowaiter, Norah Alshahrani, Saied Alshahrani, Reem I. Masoud, Alaa Alzahrani, Deema Alnuhait, Emad A. Alghamdi, Khalid Almubarak

PDF

Open Access 1 Video

TL;DR

This paper reviews Arabic post-training datasets for LLMs, highlighting their limitations in diversity, documentation, and adoption, and discusses implications for future Arabic NLP development.

Contribution

It provides a comprehensive evaluation of existing Arabic datasets, identifies critical gaps, and offers recommendations for improving dataset quality and diversity.

Findings

01

Limited task diversity in datasets

02

Inconsistent documentation and annotation

03

Low community adoption

Abstract

Post-training has emerged as a crucial technique for aligning pre-trained Large Language Models (LLMs) with human instructions, significantly enhancing their performance across a wide range of tasks. Central to this process is the quality and diversity of post-training datasets. This paper presents a review of publicly available Arabic post-training datasets on the Hugging Face Hub, organized along four key dimensions: (1) LLM Capabilities (e.g., Question Answering, Translation, Reasoning, Summarization, Dialogue, Code Generation, and Function Calling); (2) Steerability (e.g., Persona and System Prompts); (3) Alignment (e.g., Cultural, Safety, Ethics, and Fairness); and (4) Robustness. Each dataset is rigorously evaluated based on popularity, practical adoption, recency and maintenance, documentation and annotation quality, licensing transparency, and scientific contribution. Our review…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations· underline

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education