On the Learnability of Watermarks for Language Models

Chenchen Gu; Xiang Lisa Li; Percy Liang; Tatsunori Hashimoto

arXiv:2312.04469·cs.LG·May 3, 2024·2 cites

On the Learnability of Watermarks for Language Models

Chenchen Gu, Xiang Lisa Li, Percy Liang, Tatsunori Hashimoto

PDF

Open Access 1 Repo 10 Models 5 Datasets 3 Reviews

TL;DR

This paper explores whether language models can learn to generate watermarked text directly, using a method called watermark distillation, and examines the effectiveness and limitations of this approach.

Contribution

It introduces watermark distillation, a novel method for training models to generate watermarked text, and evaluates its effectiveness across different strategies and settings.

Findings

01

Models can learn to generate watermarked text with high detectability.

02

Learnability of watermarks diminishes after fine-tuning on normal text.

03

Low-distortion watermarks require high sample complexity to learn.

Abstract

Watermarking of language model outputs enables statistical detection of model-generated text, which can mitigate harms and misuses of language models. Existing watermarking strategies operate by altering the decoder of an existing language model. In this paper, we ask whether language models can directly learn to generate watermarked text, which would have significant implications for the real-world deployment of watermarks. First, learned watermarks could be used to build open models that naturally generate watermarked text, enabling watermarking for open models, where users can control the decoding procedure. Second, if watermarking is used to determine the provenance of generated text, an adversary can hurt the reputation of a victim model by spoofing its watermark and generating damaging watermarked text. To investigate the learnability of watermarks, we propose watermark…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

It is interesting to introduce forgery/spoofing attacks into the recent popular area, namely, LLM.

Weaknesses

1. This paper directly extends distillation strategies on decoding-based watermarking and then poses the limitation of weight-based watermarking. However, no solution is provided. It seems like an experimental report, and it would be better to introduce the specific solution. 2. The motivation of this paper is there is a limitation of decoding-based watermarking, namely, replacing it with a normal decoder. Is the assumption practical? In practice, for an LLM API, how do we conduct such an oper

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

* Overall, I liked the paper. The problem of watermarking open models and spoofing attacks against watermarks is timely and important. * The authors show that three watermarking methods are vulnerable to spoofing through distillation. * The authors convince me their distillation approach works for many different generation parameters against all three watermarking methods when sufficiently many samples are available to the attacker. * The authors show experiments with many state-of-the-art

Weaknesses

**Contribution to Open Model Watermarking**. As the authors show, open-model watermarking using distillation is not robust against fine-tuning. I am unclear about the contribution of the authors to open model watermarking. No prior work has used distillation to watermark LMs, hence there is no security threat. What is the use of a watermark that (i) lacks robustness and (ii) can be spoofed by design? I would love to hear the author's thoughts on this. **Limited novelty.** One would expect that

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

Originality: - Investigates the novel problem of whether language models can learn to generate watermarks themselves. This question has important implications but was previously unexplored. Quality: - Provides extensive empirical results across three watermarking schemes and hyperparameters—thorough, reproducible experiments. - Uses appropriate metrics to assess watermark detection and text quality—rigorous quantitative evaluation. - Clear methodology and training details—enable replicability.

Weaknesses

The experimental methodology could be strengthened by 1. controlling for model architecture (use the Llama-2 as the student model for both logits-based and sample-based experiments for a more intuitive comparison of the results)). This would better isolate the distillation techniques themselves. 2. increasing the dataset diversity (currently, only one dataset is used). 3. somehow, seq-rep-3 is not evaluated on the sample-based distillation. The technical writing is condensed in some areas, m

Code & Models

Repositories

chenchenygu/watermark-learnability
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Adversarial Robustness in Machine Learning