Learning to Watermark LLM-generated Text via Reinforcement Learning

Xiaojun Xu; Yuanshun Yao; Yang Liu

arXiv:2403.10553·cs.LG·March 19, 2024·3 cites

Learning to Watermark LLM-generated Text via Reinforcement Learning

Xiaojun Xu, Yuanshun Yao, Yang Liu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a novel reinforcement learning-based framework for embedding detectable watermarks into LLMs at the model level, enhancing robustness and flexibility over traditional token-level methods.

Contribution

It proposes a co-training approach that embeds watermarks into LLM weights, enabling detection without fixed models and improving robustness against attacks.

Findings

01

Watermarks are more accurate and robust.

02

The method allows open-sourcing of watermarked models.

03

Low overhead when combined with alignment techniques.

Abstract

We study how to watermark LLM outputs, i.e. embedding algorithmically detectable signals into LLM-generated text to track misuse. Unlike the current mainstream methods that work with a fixed LLM, we expand the watermark design space by including the LLM tuning stage in the watermark pipeline. While prior works focus on token-level watermark that embeds signals into the output, we design a model-level watermark that embeds signals into the LLM weights, and such signals can be detected by a paired detector. We propose a co-training framework based on reinforcement learning that iteratively (1) trains a detector to detect the generated watermarked text and (2) tunes the LLM to generate text easily detectable by the detector while keeping its normal utility. We empirically show that our watermarks are more accurate, robust, and adaptable (to new attacks). It also allows watermarked model…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

1. This paper proposes to fine-tune the LLMs to embed watermarks. 2. The proposed method is robust against different attacks. 3. The idea that combines the watermark embedding process with the alignment process is interesting.

Weaknesses

1. The detection needs the original prompt, which is usually unavailable during the detection process. 2. This paper uses the D^{nw} (human-written prompt and answer) to fine-tune the LLM and detector. What I am worried about is that the detector learned the difference between human-written text and LLM-generated text instead of un-watermarked text (text generated by unwatermarked LLMs) and watermarked text. It would be good to present the results between the original LLM and the fine-tuned LLM

Reviewer 02Rating 5Confidence 4

Strengths

• Successfully proposes and implements a watermarking method using fine-tuning and reinforcement learning • Conducts comprehensive experiments on watermark detectability and robustness • Successfully integrates the proposed fine-tuning method into existing alignment workflows

Weaknesses

• The method appears to require the prompt that generated the text being tested for watermarks. This prerequisite fundamentally differs from current inference-time watermarking methods. The authors don't explicitly discuss how this condition affects watermark embedding and detection • The requirement of having the original prompt for detection significantly limits practical detection scenarios • The detectability and robustness experiments don't explicitly discuss the impact of prompts. For ex

Reviewer 03Rating 6Confidence 4

Strengths

- This paper introduces watermarking LLM by fine-tuning, which makes watermark detection easier and more robust to attacks such as paraphrasing. - This manuscript is well-written and easy to follow.

Weaknesses

- Fine-tuning the generative models and using an additional detector for watermark verification is not new, and related methods [1, 2] are supposed to be discussed in the related work section. - I am concerned about the reliability of using a language model as the detector instead of statistical tests. - It is unknown whether the extra fine-tuning process will introduce side effects or biases into LLMs, and there is no theoretical analysis of the changed parameters by additional fine-tuning. Re

Code & Models

Repositories

xiaojunxu/learning-to-watermark-llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Natural Language Processing Techniques · Mathematics, Computing, and Information Processing

MethodsFocus