DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Yize Cheng; Wenxiao Wang; Mazda Moayeri; Soheil Feizi

arXiv:2505.23001·cs.CL·September 25, 2025

DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Yize Cheng, Wenxiao Wang, Mazda Moayeri, Soheil Feizi

PDF

Open Access 1 Repo

TL;DR

DyePack is a framework that uses backdoor attacks to reliably detect if large language models trained on benchmark test sets, ensuring transparency and preventing false accusations without needing internal model details.

Contribution

It introduces a provable backdoor-based method for detecting test set contamination in LLMs with guaranteed false positive control.

Findings

01

Successfully detects all contaminated models in multiple datasets

02

Achieves extremely low false positive rates in experiments

03

Generalizes well to open-ended generation tasks

Abstract

Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce DyePack, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, without requiring access to the loss, logits, or any internal details of the model. Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chengez/DyePack
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Software Testing and Debugging Techniques · Topic Modeling

MethodsSparse Evolutionary Training