Data Contamination in Neural Hieroglyphic Translation: A Reproducibility Study
Ammar Toutou, Abdelrahman Harb, Christine Basta

TL;DR
This study reveals significant data contamination issues in neural hieroglyphic translation datasets, demonstrating how contamination inflates performance metrics and providing a decontaminated test set for more accurate evaluation.
Contribution
The paper identifies data contamination in hieroglyphic translation datasets, quantifies its impact on model performance, and releases a decontaminated test set for future research.
Findings
Contamination inflates BLEU scores up to 83.8.
Decontamination reduces BLEU by approximately 4.6 points.
A new, clean test set enables realistic assessment of translation models.
Abstract
Ancient and endangered languages pose a unique challenge for NLP: their datasets are inherently scarce, difficult to expand, and built from formulaic corpora -- making data-quality issues especially consequential yet rarely audited. Motivated by the need to understand what current NMT can realistically achieve for such languages, we investigate hieroglyphic-to-German translation, where a recent study reported 61.5 BLEU using fine-tuned M2M-100. Our reproduction yields only 37.0 BLEU with the released model. Investigating this gap, we find 2\% of test targets appear identically in training (16/50; 50\% under 8-gram overlap at 70\% threshold). This contamination inflates scores dramatically: contaminated samples achieve up to 83.8 BLEU / 0.924 COMET-22 versus 30.9--39.2 BLEU / 0.622--0.676 COMET-22 on clean samples across five model configurations spanning two architectures.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
