ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation
Mohammed Khalil, Mohammed Sabry

TL;DR
The paper introduces ATHAR, a large high-quality dataset of 66,000 Classical Arabic to English translation samples covering diverse topics, aimed at improving translation systems and supporting cultural and scientific knowledge dissemination.
Contribution
It provides a new, extensive dataset for Classical Arabic-English translation and evaluates how current language models can benefit from it.
Findings
Models improve with fine-tuning on ATHAR.
Current LLMs underperform without specialized datasets.
The dataset enhances translation quality across various topics.
Abstract
Classical Arabic represents a significant era that encompasses the golden age of Arab culture, philosophy, and scientific literature. With a broad consensus on the importance of translating these literatures to enrich knowledge dissemination across communities, the advent of large language models (LLMs) and translation systems offers promising tools to facilitate this goal. However, we have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics, hindering the development of high-quality translation systems. In response, we present the ATHAR dataset, which comprises 66,000 high-quality classical Arabic to English translation samples that cover a wide array of topics including science, culture, and philosophy. Furthermore, we assess the performance of current state-of-the-art LLMs under various settings, concluding that there is a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques
