ASCAT: An Arabic Scientific Corpus and Benchmark for Advanced Translation Evaluation

Serry Sibaee; Khloud Al Jallad; Zineb Yousfi; Israa Elsayed Elhosiny; Yousra El-Ghawi; Batool Balah; Omer Nacar

arXiv:2604.00015·cs.CL·April 2, 2026

ASCAT: An Arabic Scientific Corpus and Benchmark for Advanced Translation Evaluation

Serry Sibaee, Khloud Al Jallad, Zineb Yousfi, Israa Elsayed Elhosiny, Yousra El-Ghawi, Batool Balah, Omer Nacar

PDF

1 Datasets

TL;DR

ASCAT is a high-quality Arabic-English scientific corpus created through multi-engine translation and expert validation, serving as a benchmark for evaluating and training scientific translation models.

Contribution

It introduces a systematic pipeline for constructing a large, validated scientific translation corpus for Arabic-English, covering multiple scientific domains.

Findings

01

Benchmarking three state-of-the-art LLMs on ASCAT shows varied translation quality.

02

The corpus contains over 67,000 English tokens and 60,000 Arabic tokens.

03

ASCAT effectively evaluates scientific translation models for Arabic.

Abstract

We present ASCAT (Arabic Scientific Corpus for Advanced Translation), a high-quality English-Arabic parallel benchmark corpus designed for scientific translation evaluation constructed through a systematic multi-engine translation and human validation pipeline. Unlike existing Arabic-English corpora that rely on short sentences or single-domain text, ASCAT targets full scientific abstracts averaging 141.7 words (English) and 111.78 words (Arabic), drawn from five scientific domains: physics, mathematics, computer science, quantum mechanics, and artificial intelligence. Each abstract was translated using three complementary architectures generative AI (Gemini), transformer-based models (Hugging Face \texttt{quickmt-en-ar}), and commercial MT APIs (Google Translate, DeepL) and subsequently validated by domain experts at the lexical, syntactic, and semantic levels. The resulting corpus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

NAMAA-Space/ASCAT-Arabic-Scientific-Translation
dataset· 55 dl
55 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.