ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

Mohammed Khalil; Mohammed Sabry

arXiv:2407.19835·cs.CL·September 8, 2025

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

Mohammed Khalil, Mohammed Sabry

PDF

Open Access 1 Datasets 1 Video

TL;DR

The paper introduces ATHAR, a large high-quality dataset of 66,000 Classical Arabic to English translation samples covering diverse topics, aimed at improving translation systems and supporting cultural and scientific knowledge dissemination.

Contribution

It provides a new, extensive dataset for Classical Arabic-English translation and evaluates how current language models can benefit from it.

Findings

01

Models improve with fine-tuning on ATHAR.

02

Current LLMs underperform without specialized datasets.

03

The dataset enhances translation quality across various topics.

Abstract

Classical Arabic represents a significant era that encompasses the golden age of Arab culture, philosophy, and scientific literature. With a broad consensus on the importance of translating these literatures to enrich knowledge dissemination across communities, the advent of large language models (LLMs) and translation systems offers promising tools to facilitate this goal. However, we have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics, hindering the development of high-quality translation systems. In response, we present the ATHAR dataset, which comprises 66,000 high-quality classical Arabic to English translation samples that cover a wide array of topics including science, culture, and philosophy. Furthermore, we assess the performance of current state-of-the-art LLMs under various settings, concluding that there is a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

mohamed-khalil/ATHAR
dataset· 146 dl
146 dl

Videos

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques