Detecting Benchmark Contamination Through Watermarking
Tom Sander, Pierre Fernandez, Saeed Mahloujifar, Alain Durmus, Chuan Guo

TL;DR
This paper presents a watermarking technique for benchmarks to detect contamination in LLM training data, ensuring evaluation reliability without compromising benchmark utility.
Contribution
It introduces a watermarking method for benchmarks and a statistical test to detect contamination in models trained on watermarked data.
Findings
Effective detection of benchmark contamination in LLMs.
Watermarking preserves benchmark utility.
Detection sensitivity increases with contamination level.
Abstract
Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release. The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark utility. During evaluation, we can detect ``radioactivity'', \ie traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-watermarking and successful contamination detection…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper is well-written, easy to follow, and features comprehensive, well-executed experiments. * The proposed technique is demonstrably effective. To confirm its reliability and understand how the repetition factor influences the strength of the statistical test, the authors trained multiple LLMs from scratch on a pre-contaminated corpus. * The authors propose a novel detection algorithm to detect radioactivity in case of tokenizer mismatch between the rephrasing LLMs and the suspect LLMs.
* The method overlaps significantly with past works that were not used as baselines or mentioned in related works [1,2,3,4]. For example, [1] also proposes a similar idea and proposes a hypothesis test that leverages watermarked rephrases. [2] proposes a dataset inference method that can also be used for detecting contamination. Works such as these should be compared in related works and evaluated against as benchmarks. * The current demonstration, while confirming the utility of the rephrase
- Proactive and Verifiable: The core strength is its shift from post-hoc, inferential detection methods to a proactive approach that provides verifiable, statistical proof (a p-value) of contamination. Meanwhile, the experiments show that even with a strong watermark, the rephrased benchmarks remain effective for evaluating and ranking models, with performance being very similar to the original versions. - Practicality: The paper addresses the practical challenge of distinct tokenizers by introd
- White-Box Access Requirement: The detection test requires full logit access to the suspect model. This limits its use to open-source models and cannot be used by external parties to audit closed, API-only models. - Vulnerability to Intentional Evasion: The framework is primarily designed to detect unintentional contamination from sources like web scraping. A determined, malicious actor could potentially devise strategies to circumvent detection, such as by rephrasing the questions again to rem
- The paper tackles a relevant and important issue concerning the modern research area. - The proposed ideas of generating the pseudo-random green list is interesting and effective. - It presents extensive empirical results for supporting the authors' claims and for analyzing the impact of the hyper parameters. - There is a theoretical guarantee (Proposition 1) for the proposed detection method.
- The proposed method seems to focus on models that are trained and make inference based on next token prediction, but the paper does not mention this limitation. - The proposed method is effective to detecting contamination, but it does not prevent it. - Some details about the proposed method and the experiments are missing. - Some part of the writing is not clear to me.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Steganography and Watermarking Techniques
