ASCAT: An Arabic Scientific Corpus and Benchmark for Advanced Translation Evaluation
Serry Sibaee, Khloud Al Jallad, Zineb Yousfi, Israa Elsayed Elhosiny, Yousra El-Ghawi, Batool Balah, Omer Nacar

TL;DR
ASCAT is a high-quality Arabic-English scientific corpus created through multi-engine translation and expert validation, serving as a benchmark for evaluating and training scientific translation models.
Contribution
It introduces a systematic pipeline for constructing a large, validated scientific translation corpus for Arabic-English, covering multiple scientific domains.
Findings
Benchmarking three state-of-the-art LLMs on ASCAT shows varied translation quality.
The corpus contains over 67,000 English tokens and 60,000 Arabic tokens.
ASCAT effectively evaluates scientific translation models for Arabic.
Abstract
We present ASCAT (Arabic Scientific Corpus for Advanced Translation), a high-quality English-Arabic parallel benchmark corpus designed for scientific translation evaluation constructed through a systematic multi-engine translation and human validation pipeline. Unlike existing Arabic-English corpora that rely on short sentences or single-domain text, ASCAT targets full scientific abstracts averaging 141.7 words (English) and 111.78 words (Arabic), drawn from five scientific domains: physics, mathematics, computer science, quantum mechanics, and artificial intelligence. Each abstract was translated using three complementary architectures generative AI (Gemini), transformer-based models (Hugging Face \texttt{quickmt-en-ar}), and commercial MT APIs (Google Translate, DeepL) and subsequently validated by domain experts at the lexical, syntactic, and semantic levels. The resulting corpus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
