Learning to Watermark LLM-generated Text via Reinforcement Learning
Xiaojun Xu, Yuanshun Yao, Yang Liu

TL;DR
This paper introduces a novel reinforcement learning-based framework for embedding detectable watermarks into LLMs at the model level, enhancing robustness and flexibility over traditional token-level methods.
Contribution
It proposes a co-training approach that embeds watermarks into LLM weights, enabling detection without fixed models and improving robustness against attacks.
Findings
Watermarks are more accurate and robust.
The method allows open-sourcing of watermarked models.
Low overhead when combined with alignment techniques.
Abstract
We study how to watermark LLM outputs, i.e. embedding algorithmically detectable signals into LLM-generated text to track misuse. Unlike the current mainstream methods that work with a fixed LLM, we expand the watermark design space by including the LLM tuning stage in the watermark pipeline. While prior works focus on token-level watermark that embeds signals into the output, we design a model-level watermark that embeds signals into the LLM weights, and such signals can be detected by a paired detector. We propose a co-training framework based on reinforcement learning that iteratively (1) trains a detector to detect the generated watermarked text and (2) tunes the LLM to generate text easily detectable by the detector while keeping its normal utility. We empirically show that our watermarks are more accurate, robust, and adaptable (to new attacks). It also allows watermarked model…
Peer Reviews
Decision·Submitted to ICLR 2025
1. This paper proposes to fine-tune the LLMs to embed watermarks. 2. The proposed method is robust against different attacks. 3. The idea that combines the watermark embedding process with the alignment process is interesting.
1. The detection needs the original prompt, which is usually unavailable during the detection process. 2. This paper uses the D^{nw} (human-written prompt and answer) to fine-tune the LLM and detector. What I am worried about is that the detector learned the difference between human-written text and LLM-generated text instead of un-watermarked text (text generated by unwatermarked LLMs) and watermarked text. It would be good to present the results between the original LLM and the fine-tuned LLM
• Successfully proposes and implements a watermarking method using fine-tuning and reinforcement learning • Conducts comprehensive experiments on watermark detectability and robustness • Successfully integrates the proposed fine-tuning method into existing alignment workflows
• The method appears to require the prompt that generated the text being tested for watermarks. This prerequisite fundamentally differs from current inference-time watermarking methods. The authors don't explicitly discuss how this condition affects watermark embedding and detection • The requirement of having the original prompt for detection significantly limits practical detection scenarios • The detectability and robustness experiments don't explicitly discuss the impact of prompts. For ex
- This paper introduces watermarking LLM by fine-tuning, which makes watermark detection easier and more robust to attacks such as paraphrasing. - This manuscript is well-written and easy to follow.
- Fine-tuning the generative models and using an additional detector for watermark verification is not new, and related methods [1, 2] are supposed to be discussed in the related work section. - I am concerned about the reliability of using a language model as the detector instead of statistical tests. - It is unknown whether the extra fine-tuning process will introduce side effects or biases into LLMs, and there is no theoretical analysis of the changed parameters by additional fine-tuning. Re
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Natural Language Processing Techniques · Mathematics, Computing, and Information Processing
MethodsFocus
