TL;DR
This paper introduces an end-to-end reinforcement learning framework for watermarking large language models, balancing detectability, quality, robustness, and security, and outperforming existing methods in resisting spoofing attacks.
Contribution
The paper presents a novel RL-based watermarking method with an anchoring mechanism and regularization to enhance stability and security, addressing challenges of reward hacking and multi-criteria optimization.
Findings
Achieves state-of-the-art trade-offs across multiple watermarking criteria.
Improves resistance to spoofing attacks without sacrificing text quality.
Demonstrates effectiveness on standard benchmarks with two LLMs.
Abstract
Watermarking has emerged as a promising solution for tracing and authenticating text generated by large language models (LLMs). A common approach to LLM watermarking is to construct a green/red token list and assign higher or lower generation probabilities to the corresponding tokens, respectively. However, most existing watermarking algorithms rely on heuristic green/red token list designs, as directly optimizing the list design with techniques such as reinforcement learning (RL) comes with several challenges. First, desirable watermarking involves multiple criteria, i.e., detectability, text quality, robustness against removal attacks, and security against spoofing attacks. Directly optimizing for these criteria introduces many partially conflicting reward terms, leading to an unstable convergence process. Second, the vast action space of green/red token list choices is susceptible to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
