Dataset Watermarking for Closed LLMs with Provable Detection
Pengrun Huang, Kamalika Chaudhuri, Yu-Xiang Wang

TL;DR
This paper presents a novel dataset watermarking technique for closed large language models that enables provable detection of proprietary training data signatures with minimal impact on model utility.
Contribution
It introduces the first provable dataset watermarking method for closed LLMs, embedding detectable signatures via co-occurrence frequency manipulation.
Findings
Reliable watermark detection with p < 0.01 in fine-tuning
Effective watermark detection even when watermarked data is 1% of total tokens
Preserves model utility and semantic integrity
Abstract
Large language models (LLMs) are pre-trained and post-trained on vast amounts of loosely curated data, raising the possibility that these models may have been trained on proprietary datasets or the same benchmarks used for evaluation. This motivates the need for dataset watermarking: designing datasets such that training on them leaves detectable signatures in the resulting model. Prior work has explored this problem for open models. We introduce the first dataset watermarking method for closed LLMs with provable detection. In particular, we embed a dataset-level watermark signal by increasing the co-occurrence frequency of randomly selected word pairs through rephrasing, and detect it using a statistical test on co-occurrence patterns in model-generated outputs. We evaluate our method with multiple base models and benchmark datasets and show that it reliably detects the watermark ($p…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
