Can Watermarked LLMs be Identified by Users via Crafted Prompts?

Aiwei Liu; Sheng Guan; Yiming Liu; Leyi Pan; Yifei Zhang; Liancheng; Fang; Lijie Wen; Philip S. Yu; Xuming Hu

arXiv:2410.03168·cs.CR·January 29, 2025

Can Watermarked LLMs be Identified by Users via Crafted Prompts?

Aiwei Liu, Sheng Guan, Yiming Liu, Leyi Pan, Yifei Zhang, Liancheng, Fang, Lijie Wen, Philip S. Yu, Xuming Hu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper demonstrates that watermarked LLMs can be identified by users through crafted prompts, revealing a vulnerability in current watermarking techniques, and proposes strategies to improve watermark imperceptibility.

Contribution

It introduces Water-Probe, an algorithm to detect watermarks via prompts, and proposes Water-Bag to enhance watermark imperceptibility by increasing key randomness.

Findings

01

Watermarked LLMs are easily identified with well-designed prompts.

02

Water-Probe achieves low false positive rates.

03

Water-Bag strategy improves watermark imperceptibility.

Abstract

Text watermarking for Large Language Models (LLMs) has made significant progress in detecting LLM outputs and preventing misuse. Current watermarking techniques offer high detectability, minimal impact on text quality, and robustness to text editing. However, current researches lack investigation into the imperceptibility of watermarking techniques in LLM services. This is crucial as LLM providers may not want to disclose the presence of watermarks in real-world scenarios, as it could reduce user willingness to use the service and make watermarks more vulnerable to attacks. This work is the first to investigate the imperceptibility of watermarked LLMs. We design an identification algorithm called Water-Probe that detects watermarks through well-designed prompts to the LLM. Our key motivation is that current watermarked LLMs expose consistent biases under the same watermark key,…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 4

Strengths

1 This work is the first study on the imperceptibility of watermarked LLMs. 2 This paper is well organized and written, make it easy to follow. 3 The experiments are conducted across different LLMs and with different sampling methods and temperature settings. The conclusion and discussion based on the evaluation results are clear.

Weaknesses

The threat model should be further described, especially in terms of the prior knowledge assumptions of the detector.

Reviewer 02Rating 8Confidence 3

Strengths

1. It is reasonable to use repeated sampling to detect whether an LLM has been watermarked. 2. Prompts have been designed to reveal the connection between generated text and watermark keys in a black-box setting. 3. The evaluation is comprehensive, and many different watermarking methods have been tested. Experimental results also show that the proposed method works.

Weaknesses

1. Symbol reuse. In Section 3 and Section 4, the symbol $P$ represents both the model distribution in $P_M$ and the prompt in $P_1, P_2, ..., P_N$, which can be confusing. Too many $P$ across line 222 to line 227. 2. Line 289, Figure 2. ‘KTH’ does not refer to anything else in other contexts. Is it Exp-Edit or ITS-Edit?

Reviewer 03Rating 6Confidence 4

Strengths

- **Accuracy and Robustness**: The experimental results demonstrate the high accuracy and robustness of the Water-Probe algorithm across various LLMs, watermarking methods, and generation settings. The low false positive rate for non-watermarked LLMs further strengthens the algorithm’s reliability. - **Practical Solution**: The proposed Water-Bag strategy offers a practical solution to improve the imperceptibility of watermarks, which is a critical concern for LLM providers.

Weaknesses

- **Concept Confusion**: The manuscript mislead concept of watermarking and fingerprinting, the proposed method should be categorized into fingerprinting instead of watermarking. see Question 1. - **Lack of Controbution**: The authors inputs prompts to see the response of watermarked LLM and non-watermarked LLM, which is so called identification algorithm. The manuscript just defeines some concept, samples prompt to see similarity of the inspected models. - **Limited Scope of Water-Probe**: The

Code & Models

Repositories

thu-bpm/watermarked_llm_identification
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptimization and Search Problems · Auction Theory and Applications

Methodstravel james