Detecting Benchmark Contamination Through Watermarking

Tom Sander; Pierre Fernandez; Saeed Mahloujifar; Alain Durmus; Chuan Guo

arXiv:2502.17259·cs.CR·July 22, 2025

Detecting Benchmark Contamination Through Watermarking

Tom Sander, Pierre Fernandez, Saeed Mahloujifar, Alain Durmus, Chuan Guo

PDF

Open Access 3 Reviews

TL;DR

This paper presents a watermarking technique for benchmarks to detect contamination in LLM training data, ensuring evaluation reliability without compromising benchmark utility.

Contribution

It introduces a watermarking method for benchmarks and a statistical test to detect contamination in models trained on watermarked data.

Findings

01

Effective detection of benchmark contamination in LLMs.

02

Watermarking preserves benchmark utility.

03

Detection sensitivity increases with contamination level.

Abstract

Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release. The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark utility. During evaluation, we can detect ``radioactivity'', \ie traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-watermarking and successful contamination detection…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

* The paper is well-written, easy to follow, and features comprehensive, well-executed experiments. * The proposed technique is demonstrably effective. To confirm its reliability and understand how the repetition factor influences the strength of the statistical test, the authors trained multiple LLMs from scratch on a pre-contaminated corpus. * The authors propose a novel detection algorithm to detect radioactivity in case of tokenizer mismatch between the rephrasing LLMs and the suspect LLMs.

Weaknesses

* The method overlaps significantly with past works that were not used as baselines or mentioned in related works [1,2,3,4]. For example, [1] also proposes a similar idea and proposes a hypothesis test that leverages watermarked rephrases. [2] proposes a dataset inference method that can also be used for detecting contamination. Works such as these should be compared in related works and evaluated against as benchmarks. * The current demonstration, while confirming the utility of the rephrase

Reviewer 02Rating 6Confidence 4

Strengths

- Proactive and Verifiable: The core strength is its shift from post-hoc, inferential detection methods to a proactive approach that provides verifiable, statistical proof (a p-value) of contamination. Meanwhile, the experiments show that even with a strong watermark, the rephrased benchmarks remain effective for evaluating and ranking models, with performance being very similar to the original versions. - Practicality: The paper addresses the practical challenge of distinct tokenizers by introd

Weaknesses

- White-Box Access Requirement: The detection test requires full logit access to the suspect model. This limits its use to open-source models and cannot be used by external parties to audit closed, API-only models. - Vulnerability to Intentional Evasion: The framework is primarily designed to detect unintentional contamination from sources like web scraping. A determined, malicious actor could potentially devise strategies to circumvent detection, such as by rephrasing the questions again to rem

Reviewer 03Rating 6Confidence 3

Strengths

- The paper tackles a relevant and important issue concerning the modern research area. - The proposed ideas of generating the pseudo-random green list is interesting and effective. - It presents extensive empirical results for supporting the authors' claims and for analyzing the impact of the hyper parameters. - There is a theoretical guarantee (Proposition 1) for the proposed detection method.

Weaknesses

- The proposed method seems to focus on models that are trained and make inference based on next token prediction, but the paper does not mention this limitation. - The proposed method is effective to detecting contamination, but it does not prevent it. - Some details about the proposed method and the experiments are missing. - Some part of the writing is not clear to me.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Steganography and Watermarking Techniques