A Reinforcement Learning Framework for Robust and Secure LLM Watermarking

Li An; Yujian Liu; Yepeng Liu; Yuheng Bu; Yang Zhang; Shiyu Chang

arXiv:2510.21053·cs.CR·October 27, 2025

A Reinforcement Learning Framework for Robust and Secure LLM Watermarking

Li An, Yujian Liu, Yepeng Liu, Yuheng Bu, Yang Zhang, Shiyu Chang

PDF

1 Video

TL;DR

This paper introduces an end-to-end reinforcement learning framework for watermarking large language models, balancing detectability, quality, robustness, and security, and outperforming existing methods in resisting spoofing attacks.

Contribution

The paper presents a novel RL-based watermarking method with an anchoring mechanism and regularization to enhance stability and security, addressing challenges of reward hacking and multi-criteria optimization.

Findings

01

Achieves state-of-the-art trade-offs across multiple watermarking criteria.

02

Improves resistance to spoofing attacks without sacrificing text quality.

03

Demonstrates effectiveness on standard benchmarks with two LLMs.

Abstract

Watermarking has emerged as a promising solution for tracing and authenticating text generated by large language models (LLMs). A common approach to LLM watermarking is to construct a green/red token list and assign higher or lower generation probabilities to the corresponding tokens, respectively. However, most existing watermarking algorithms rely on heuristic green/red token list designs, as directly optimizing the list design with techniques such as reinforcement learning (RL) comes with several challenges. First, desirable watermarking involves multiple criteria, i.e., detectability, text quality, robustness against removal attacks, and security against spoofing attacks. Directly optimizing for these criteria introduces many partially conflicting reward terms, leading to an unstable convergence process. Second, the vast action space of green/red token list choices is susceptible to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

A Reinforcement Learning Framework for Robust and Secure LLM Watermarking· underline