Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora

Chaymaa Abbas; Nour Shamaa; Mariette Awad

arXiv:2601.14994·cs.CL·January 22, 2026

Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora

Chaymaa Abbas, Nour Shamaa, Mariette Awad

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how data contamination affects multilingual language models, especially through translation, and proposes a translation-aware detection method to improve contamination identification across languages.

Contribution

It introduces a translation-aware contamination detection approach that effectively uncovers data contamination in multilingual benchmarks, addressing limitations of English-only methods.

Findings

01

Translation suppresses traditional contamination signals.

02

Models benefit from contaminated data even in Arabic.

03

Translation-aware detection reliably exposes contamination.

Abstract

Data contamination undermines the validity of Large Language Model evaluation by enabling models to rely on memorized benchmark content rather than true generalization. While prior work has proposed contamination detection methods, these approaches are largely limited to English benchmarks, leaving multilingual contamination poorly understood. In this work, we investigate contamination dynamics in multilingual settings by fine-tuning several open-weight LLMs on varying proportions of Arabic datasets and evaluating them on original English benchmarks. To detect memorization, we extend the Tested Slot Guessing method with a choice-reordering strategy and incorporate Min-K% probability analysis, capturing both behavioral and distributional contamination signals. Our results show that translation into Arabic suppresses conventional contamination indicators, yet models still benefit from…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 0Confidence 5

Strengths

The authors select an important problem to work on in a timely manner.

Weaknesses

The experimental design is quite weak from all aspects. Is the objective measuring the presence of cross lingual contamination on top of english contamination, what is the reason behind adding english test example into the fine-tuning mixture? Why not just check the impact on arabic contamination? In the current setting even p=0 is fully contaminated with english test samples. Why only use TS-Guessing, there are many methods that use model outputs to detect training data membership or contamin

Reviewer 02Rating 2Confidence 4

Strengths

1. Strong, Replicable Methodology: The paper's experimental design is its greatest strength. The controlled study fine-tuning on 0-100% translated data is clean, and the results are unambiguous. 2. Effective Use of Probing: The adaptation of TS-Guessing with choice-reordering is clever and provides compelling, direct evidence (the high IDR) that memorization is the cause of the performance gains. 3. Valuable Data Point: The specific focus on Arabic provides a useful case study, extending the

Weaknesses

1. **Missing Critical Prior Work:** The most significant weakness is the failure compare with existing work [1], which is not just "related"; it established the very phenomenon of cross-lingual contamination that this paper investigates. This omission makes the paper's framing as a novel investigation of a "blind spot" feel misleading. 2. **Limited Scope:** The study is confined to a single language (Arabic) and three benchmarks. The prior CrossLanguage work was far broader, testing seven lan

Reviewer 03Rating 0Confidence 4

Strengths

1. The scope of the problem is important. As more research is done revolving around large language models, contamination could inflate models performance, and lead researchers towards incorrect conclusions when they build their research upon that.

Weaknesses

1. Lack of Novelty. The same idea (translation can cause hard-to-find data contaminations) has been explored in existing research [1] [2] 2. The method author proposed TACD wasn't even evaluated in their experiments (if I read the paper correctly, it's only a proposal) 3. Writing and plotting can use some improvements. You can drop some of that item lists, and images are overly large. [1]: Yao, Feng, et al. "Data contamination can cross language barriers." arXiv preprint arXiv:2406.13236 (2024)

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Topic Modeling