Data Contamination in Neural Hieroglyphic Translation: A Reproducibility Study

Ammar Toutou; Abdelrahman Harb; Christine Basta

arXiv:2605.07453·cs.CL·May 11, 2026

Data Contamination in Neural Hieroglyphic Translation: A Reproducibility Study

Ammar Toutou, Abdelrahman Harb, Christine Basta

PDF

TL;DR

This study reveals significant data contamination issues in neural hieroglyphic translation datasets, demonstrating how contamination inflates performance metrics and providing a decontaminated test set for more accurate evaluation.

Contribution

The paper identifies data contamination in hieroglyphic translation datasets, quantifies its impact on model performance, and releases a decontaminated test set for future research.

Findings

01

Contamination inflates BLEU scores up to 83.8.

02

Decontamination reduces BLEU by approximately 4.6 points.

03

A new, clean test set enables realistic assessment of translation models.

Abstract

Ancient and endangered languages pose a unique challenge for NLP: their datasets are inherently scarce, difficult to expand, and built from formulaic corpora -- making data-quality issues especially consequential yet rarely audited. Motivated by the need to understand what current NMT can realistically achieve for such languages, we investigate hieroglyphic-to-German translation, where a recent study reported 61.5 BLEU using fine-tuned M2M-100. Our reproduction yields only 37.0 BLEU with the released model. Investigating this gap, we find 2\% of test targets appear identically in training (16/50; 50\% under 8-gram overlap at 70\% threshold). This contamination inflates scores dramatically: contaminated samples achieve up to 83.8 BLEU / 0.924 COMET-22 versus 30.9--39.2 BLEU / 0.622--0.676 COMET-22 on clean samples across five model configurations spanning two architectures.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.