Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora
Chaymaa Abbas, Nour Shamaa, Mariette Awad

TL;DR
This paper investigates how data contamination affects multilingual language models, especially through translation, and proposes a translation-aware detection method to improve contamination identification across languages.
Contribution
It introduces a translation-aware contamination detection approach that effectively uncovers data contamination in multilingual benchmarks, addressing limitations of English-only methods.
Findings
Translation suppresses traditional contamination signals.
Models benefit from contaminated data even in Arabic.
Translation-aware detection reliably exposes contamination.
Abstract
Data contamination undermines the validity of Large Language Model evaluation by enabling models to rely on memorized benchmark content rather than true generalization. While prior work has proposed contamination detection methods, these approaches are largely limited to English benchmarks, leaving multilingual contamination poorly understood. In this work, we investigate contamination dynamics in multilingual settings by fine-tuning several open-weight LLMs on varying proportions of Arabic datasets and evaluating them on original English benchmarks. To detect memorization, we extend the Tested Slot Guessing method with a choice-reordering strategy and incorporate Min-K% probability analysis, capturing both behavioral and distributional contamination signals. Our results show that translation into Arabic suppresses conventional contamination indicators, yet models still benefit from…
Peer Reviews
Decision·Submitted to ICLR 2026
The authors select an important problem to work on in a timely manner.
The experimental design is quite weak from all aspects. Is the objective measuring the presence of cross lingual contamination on top of english contamination, what is the reason behind adding english test example into the fine-tuning mixture? Why not just check the impact on arabic contamination? In the current setting even p=0 is fully contaminated with english test samples. Why only use TS-Guessing, there are many methods that use model outputs to detect training data membership or contamin
1. Strong, Replicable Methodology: The paper's experimental design is its greatest strength. The controlled study fine-tuning on 0-100% translated data is clean, and the results are unambiguous. 2. Effective Use of Probing: The adaptation of TS-Guessing with choice-reordering is clever and provides compelling, direct evidence (the high IDR) that memorization is the cause of the performance gains. 3. Valuable Data Point: The specific focus on Arabic provides a useful case study, extending the
1. **Missing Critical Prior Work:** The most significant weakness is the failure compare with existing work [1], which is not just "related"; it established the very phenomenon of cross-lingual contamination that this paper investigates. This omission makes the paper's framing as a novel investigation of a "blind spot" feel misleading. 2. **Limited Scope:** The study is confined to a single language (Arabic) and three benchmarks. The prior CrossLanguage work was far broader, testing seven lan
1. The scope of the problem is important. As more research is done revolving around large language models, contamination could inflate models performance, and lead researchers towards incorrect conclusions when they build their research upon that.
1. Lack of Novelty. The same idea (translation can cause hard-to-find data contaminations) has been explored in existing research [1] [2] 2. The method author proposed TACD wasn't even evaluated in their experiments (if I read the paper correctly, it's only a proposal) 3. Writing and plotting can use some improvements. You can drop some of that item lists, and images are overly large. [1]: Yao, Feng, et al. "Data contamination can cross language barriers." arXiv preprint arXiv:2406.13236 (2024)
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Topic Modeling
