On the Learnability of Watermarks for Language Models
Chenchen Gu, Xiang Lisa Li, Percy Liang, Tatsunori Hashimoto

TL;DR
This paper explores whether language models can learn to generate watermarked text directly, using a method called watermark distillation, and examines the effectiveness and limitations of this approach.
Contribution
It introduces watermark distillation, a novel method for training models to generate watermarked text, and evaluates its effectiveness across different strategies and settings.
Findings
Models can learn to generate watermarked text with high detectability.
Learnability of watermarks diminishes after fine-tuning on normal text.
Low-distortion watermarks require high sample complexity to learn.
Abstract
Watermarking of language model outputs enables statistical detection of model-generated text, which can mitigate harms and misuses of language models. Existing watermarking strategies operate by altering the decoder of an existing language model. In this paper, we ask whether language models can directly learn to generate watermarked text, which would have significant implications for the real-world deployment of watermarks. First, learned watermarks could be used to build open models that naturally generate watermarked text, enabling watermarking for open models, where users can control the decoding procedure. Second, if watermarking is used to determine the provenance of generated text, an adversary can hurt the reputation of a victim model by spoofing its watermark and generating damaging watermarked text. To investigate the learnability of watermarks, we propose watermark…
Peer Reviews
Decision·ICLR 2024 poster
It is interesting to introduce forgery/spoofing attacks into the recent popular area, namely, LLM.
1. This paper directly extends distillation strategies on decoding-based watermarking and then poses the limitation of weight-based watermarking. However, no solution is provided. It seems like an experimental report, and it would be better to introduce the specific solution. 2. The motivation of this paper is there is a limitation of decoding-based watermarking, namely, replacing it with a normal decoder. Is the assumption practical? In practice, for an LLM API, how do we conduct such an oper
* Overall, I liked the paper. The problem of watermarking open models and spoofing attacks against watermarks is timely and important. * The authors show that three watermarking methods are vulnerable to spoofing through distillation. * The authors convince me their distillation approach works for many different generation parameters against all three watermarking methods when sufficiently many samples are available to the attacker. * The authors show experiments with many state-of-the-art
**Contribution to Open Model Watermarking**. As the authors show, open-model watermarking using distillation is not robust against fine-tuning. I am unclear about the contribution of the authors to open model watermarking. No prior work has used distillation to watermark LMs, hence there is no security threat. What is the use of a watermark that (i) lacks robustness and (ii) can be spoofed by design? I would love to hear the author's thoughts on this. **Limited novelty.** One would expect that
Originality: - Investigates the novel problem of whether language models can learn to generate watermarks themselves. This question has important implications but was previously unexplored. Quality: - Provides extensive empirical results across three watermarking schemes and hyperparameters—thorough, reproducible experiments. - Uses appropriate metrics to assess watermark detection and text quality—rigorous quantitative evaluation. - Clear methodology and training details—enable replicability.
The experimental methodology could be strengthened by 1. controlling for model architecture (use the Llama-2 as the student model for both logits-based and sample-based experiments for a more intuitive comparison of the results)). This would better isolate the distillation techniques themselves. 2. increasing the dataset diversity (currently, only one dataset is used). 3. somehow, seq-rep-3 is not evaluated on the sample-based distillation. The technical writing is condensed in some areas, m
Code & Models
- 🤗cygu/llama-2-7b-logit-watermark-distill-kgw-k1-gamma0.25-delta2model· 140 dl· ♡ 1140 dl♡ 1
- 🤗cygu/llama-2-7b-logit-watermark-distill-kgw-k1-gamma0.25-delta1model· 1 dl· ♡ 11 dl♡ 1
- 🤗cygu/llama-2-7b-logit-watermark-distill-aar-k2model· 6 dl6 dl
- 🤗cygu/llama-2-7b-logit-watermark-distill-aar-k3model· 3 dl3 dl
- 🤗cygu/llama-2-7b-logit-watermark-distill-aar-k4model· 4 dl4 dl
- 🤗cygu/llama-2-7b-logit-watermark-distill-kth-shift1model· 4 dl4 dl
- 🤗cygu/llama-2-7b-logit-watermark-distill-kth-shift2model· 5 dl5 dl
- 🤗cygu/llama-2-7b-logit-watermark-distill-kth-shift4model· 5 dl5 dl
- 🤗cygu/llama-2-7b-logit-watermark-distill-kth-shift256model· 1 dl1 dl
- 🤗cygu/llama-2-7b-sampling-watermark-distill-kgw-k1-gamma0.25-delta2model· 2 dl2 dl
- cygu/sampling-distill-train-data-kgw-k1-gamma0.25-delta2dataset· 7 dl7 dl
- cygu/sampling-distill-train-data-kgw-k0-gamma0.25-delta1dataset· 16 dl16 dl
- cygu/sampling-distill-train-data-kgw-k0-gamma0.25-delta2dataset· 4 dl4 dl
- cygu/sampling-distill-train-data-kgw-k2-gamma0.25-delta1dataset· 9 dl9 dl
- cygu/sampling-distill-train-data-kgw-k2-gamma0.25-delta2dataset· 12 dl12 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Adversarial Robustness in Machine Learning
