On the Reliability of Watermarks for Large Language Models
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid, Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, Tom, Goldstein

TL;DR
This paper evaluates the robustness of watermarks in large language models against human and machine paraphrasing, demonstrating that watermarks remain detectable despite modifications, thus supporting their use for identifying AI-generated text.
Contribution
The study provides a comprehensive analysis of watermark robustness in realistic scenarios, introducing new detection schemes and comparing their effectiveness against various attacks.
Findings
Watermarks are detectable after human and machine paraphrasing.
Detection remains reliable with high confidence after significant text modifications.
Watermark detection is effective even when embedded within large, mixed documents.
Abstract
As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. While these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text,…
Peer Reviews
Decision·ICLR 2024 poster
* Overall, I quite liked the paper and think that it addresses an interesting problem. * The appendix is extremely detailed and offers a lot of valuable information on the reproducibility of their study. * The paper shows that the studied text watermark is robust against many attacks, including human paraphrasing, which is somewhat surprising. * The paper outlines many useful parameters and graphs for evaluating the robustness of watermarking.
The main issue I have with the paper is an unclear threat model: What is an attacker allowed to do to paraphrase sequence correctly? When is a paraphrased text too dissimilar from the watermarked text? The paper does not answer these fundamental questions, but follow-up papers must rely on these answers to propose improved attacks. Consider the following example: A human and a paraphraser want to preserve the "meaning" of the watermarked text. The watermark hides with high probability in high-
1. The research scope (whether an AI-general text embedded with watermarks can be easily removed or not) is an important and timely topic. 2. Empirical results are abundant and show the promise of the reliability of the evaluated watermark methods
1. Probably due to page limits, most spaces are used for presenting numerical results. The methodology section, including a new watermark method (e.g. SelfHash) and a new detection method (WinMax) in Sec. 3 is relatively short (roughly one page), and much important information is deferred to the Appendix. 2. The analysis will be more complete if it includes more recent and advanced post-hoc detection methods (such as RADAR https://arxiv.org/abs/2307.03838), because DetectGPT is known to be non-r
Comprehensive Evaluation: The paper conducts a thorough and comprehensive evaluation of watermarking, considering various real-world attack scenarios, including paraphrasing, copy-paste, and human rewriting. This multifaceted approach provides valuable insights into the strengths and limitations of watermarking in practical settings. Comparison to Alternative Methods: The paper not only focuses on watermarking but also compares it to alternative detection approaches, including post-hoc detector
Lack of Theoretical Background: The paper does not delve deeply into the theoretical aspects of watermarking, which could be crucial in understanding the underlying principles and potential vulnerabilities. A more robust theoretical foundation could enhance the paper's overall quality. Inherent Model Bias: The paper uses a specific language model (llama) for its experiments. While this model is justified and used for practical reasons, the results might not be universally applicable to all lang
Code & Models
Videos
"Watermarking Language Models" paper and GPTZero EXPLAINED | How to detect text by ChatGPT?· youtube
Taxonomy
TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Natural Language Processing Techniques
