On the Reliability of Watermarks for Large Language Models

John Kirchenbauer; Jonas Geiping; Yuxin Wen; Manli Shu; Khalid; Saifullah; Kezhi Kong; Kasun Fernando; Aniruddha Saha; Micah Goldblum; Tom; Goldstein

arXiv:2306.04634·cs.LG·May 3, 2024·23 cites

On the Reliability of Watermarks for Large Language Models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid, Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, Tom, Goldstein

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper evaluates the robustness of watermarks in large language models against human and machine paraphrasing, demonstrating that watermarks remain detectable despite modifications, thus supporting their use for identifying AI-generated text.

Contribution

The study provides a comprehensive analysis of watermark robustness in realistic scenarios, introducing new detection schemes and comparing their effectiveness against various attacks.

Findings

01

Watermarks are detectable after human and machine paraphrasing.

02

Detection remains reliable with high confidence after significant text modifications.

03

Watermark detection is effective even when embedded within large, mixed documents.

Abstract

As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. While these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text,…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

* Overall, I quite liked the paper and think that it addresses an interesting problem. * The appendix is extremely detailed and offers a lot of valuable information on the reproducibility of their study. * The paper shows that the studied text watermark is robust against many attacks, including human paraphrasing, which is somewhat surprising. * The paper outlines many useful parameters and graphs for evaluating the robustness of watermarking.

Weaknesses

The main issue I have with the paper is an unclear threat model: What is an attacker allowed to do to paraphrase sequence correctly? When is a paraphrased text too dissimilar from the watermarked text? The paper does not answer these fundamental questions, but follow-up papers must rely on these answers to propose improved attacks. Consider the following example: A human and a paraphraser want to preserve the "meaning" of the watermarked text. The watermark hides with high probability in high-

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The research scope (whether an AI-general text embedded with watermarks can be easily removed or not) is an important and timely topic. 2. Empirical results are abundant and show the promise of the reliability of the evaluated watermark methods

Weaknesses

1. Probably due to page limits, most spaces are used for presenting numerical results. The methodology section, including a new watermark method (e.g. SelfHash) and a new detection method (WinMax) in Sec. 3 is relatively short (roughly one page), and much important information is deferred to the Appendix. 2. The analysis will be more complete if it includes more recent and advanced post-hoc detection methods (such as RADAR https://arxiv.org/abs/2307.03838), because DetectGPT is known to be non-r

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

Comprehensive Evaluation: The paper conducts a thorough and comprehensive evaluation of watermarking, considering various real-world attack scenarios, including paraphrasing, copy-paste, and human rewriting. This multifaceted approach provides valuable insights into the strengths and limitations of watermarking in practical settings. Comparison to Alternative Methods: The paper not only focuses on watermarking but also compares it to alternative detection approaches, including post-hoc detector

Weaknesses

Lack of Theoretical Background: The paper does not delve deeply into the theoretical aspects of watermarking, which could be crucial in understanding the underlying principles and potential vulnerabilities. A more robust theoretical foundation could enhance the paper's overall quality. Inherent Model Bias: The paper uses a specific language model (llama) for its experiments. While this model is justified and used for practical reasons, the results might not be universally applicable to all lang

Code & Models

Repositories

jwkirchenbauer/lm-watermarking
pytorchOfficial

Videos

"Watermarking Language Models" paper and GPTZero EXPLAINED | How to detect text by ChatGPT?· youtube

Taxonomy

TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Natural Language Processing Techniques