Black-Box Detection of Language Model Watermarks

Thibaud Gloaguen; Nikola Jovanovi\'c; Robin Staab; Martin Vechev

arXiv:2405.20777·cs.CR·February 25, 2025

Black-Box Detection of Language Model Watermarks

Thibaud Gloaguen, Nikola Jovanovi\'c, Robin Staab, Martin Vechev

PDF

Open Access 3 Reviews

TL;DR

This paper develops statistical tests to detect and analyze the presence of watermarking in language models in black-box settings, revealing that current watermarking schemes are more detectable than previously thought.

Contribution

It introduces the first rigorous black-box detection methods for multiple watermarking schemes, demonstrating their effectiveness on various models and real-world APIs.

Findings

01

Watermarking schemes are more detectable than previously believed.

02

Effective statistical tests can identify watermark presence with limited queries.

03

Detection methods work on open-source models and real-world APIs.

Abstract

Watermarking has emerged as a promising way to detect LLM-generated text, by augmenting LLM generations with later detectable signals. Recent work has proposed multiple families of watermarking schemes, several of which focus on preserving the LLM distribution. This distribution-preservation property is motivated by the fact that it is a tractable proxy for retaining LLM capabilities, as well as the inherently implied undetectability of the watermark by downstream users. Yet, despite much discourse around undetectability, no prior work has investigated the practical detectability of any of the current watermarking schemes in a realistic black-box setting. In this work we tackle this for the first time, developing rigorous statistical tests to detect the presence, and estimate parameters, of all three popular watermarking scheme families, using only a limited number of black-box queries.…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

A huge number of watermarking papers have come out recently. Many of them ask whether their watermarks harm generation quality by performing experimental evaluations, but these are inherently limited: There is no way to experimentally guarantee that the watermark will preserve the quality under *every possible* use-case of the model. Therefore, perhaps a more useful test of quality is to simply attempt to detect it. If attacks that are specifically designed to detect the watermark still fail to

Weaknesses

It is not surprising that they were able to easily detect the schemes they attacked. Those schemes are not designed to be undetectable. In the "Limitations" section, they justify the choice to only consider these schemes with the claim that the provably-undetectable schemes "lack experimental validation" and "are not yet practical due to slow generation speed." However, I believe these claims require justification because: - "Excuse me, sir? Your language model is leaking (information)" is a pr

Reviewer 02Rating 6Confidence 4

Strengths

1. This paper, for the first time, examines the detectability of current watermarking schemes in a practical black-box setting, which is practical in the real detection scenario. 2. The method is well written and the method makes sense and is easily understood. Each method has a clear section structure. 3. The experimental results in the black-box scenario verify the effectiveness of the method.

Weaknesses

1. Although the authors pointed out that their motivation is to study the ability of current watermarks to resist detection, they did not highlight the significance of watermark detection in real scenarios. Providing specific application scenarios of black-box watermark detection can help readers better understand the contribution of black-box watermark detection. 2. The results in Table 1 indicate the method in the paper is constrained by the need for distinct detection techniques for various

Reviewer 03Rating 6Confidence 5

Strengths

This paper suggests that current watermarking schemes may be susceptible to detection in the black-box setting and verify it in their experiments.

Weaknesses

- This paper lacks a clear mathematical presentation of its algorithms, and the descriptions are often vague. - The detection tasks for Fixed-Sampling and Cache-Augmented watermarks are trivial, and the proposed simple algorithm can be easily defended against. 1. The detection algorithm based on unique outputs is not practical. In real-world applications, one can simply skip the first few tokens to ensure that generated outputs are different, which has been proposed in Algorithm 3 in Christ e

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Digital and Cyber Forensics · Natural Language Processing Techniques

MethodsSparse Evolutionary Training · Focus