A Watermark for Black-Box Language Models
Dara Bahri, John Wieting

TL;DR
This paper introduces a novel black-box watermarking scheme for large language models that requires only sequence sampling access, offering distortion-free detection, chaining capabilities, and outperforming some white-box methods in experiments.
Contribution
A new watermarking method for LLMs that operates with black-box access, providing performance guarantees and flexibility for chaining and nested applications.
Findings
Effective black-box watermark detection demonstrated
Outperforms some white-box schemes in experiments
Supports chaining and nested watermarking
Abstract
Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require white-box access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. black-box access), boasts a distortion-free property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.
Peer Reviews
Decision·Submitted to ICLR 2025
- While it is based on a generalization of ideas from existing schemes, the exact scheme proposed is to the best of my knowledge novel. The authors do a good job of exploring different variants of the scheme (e.g., CDF) in a principled way. - The theoretical results are sound. I especially appreciate that Theorem 4.2 is carefully placed into context and analyzed for various input values to demonstrate its implications. - Experiments are very thorough, involve important aspects such as quality
As a meta point, the authors are using the 2024 style file and should update it to the latest version to avoid desk rejection. I understand that this is an honest mistake, but in particular the lack of usual line numbers is making it hard to refer to particular parts of the writeup. The weaknesses of the paper are in my view: (1) Limitations of the evaluation setup - The authors recognize that AUC is not the most practically relevant metric yet resolve this by proposing a new metric (AUC below
The paper seems to do a good job of optimizing both their scheme, and the schemes they compare against. In particular, it is interesting that making the watermark detector of Aaronson length-aware improves performance as much as it does.
The ideas and method are straightforward adaptations of existing work. The technique is essentially identical to Aaronson's, except that they use rejection sampling instead of the Gumbel-max trick. The scheme is also only distortion-free under certain assumptions about the text, which essentially translate to it having consistently high entropy.
The method is effective in a black-box setting. It only requires to sample sequences from LLMs. The paper provides formal guarantees for detection performance.
The paper’s motivation could be articulated more clearly. The main motivation stems from the security risks associated with providing API access that exposes logits to third-party users for applying their own watermark. However, simpler methods could enhance security; for instance, instead of exposing logits, LLMs could offer APIs to gather specific information users want to integrate. Furthermore, the paper presents a zero-bit watermarking technique, which only detects whether a text is waterma
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Steganography and Watermarking Techniques · Cryptography and Data Security
