RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience
Hanbo Huang, Xuan Gong, Yiran Zhang, Hao Zheng, Shiyu Liang

TL;DR
This paper introduces RLSpoofer, a reinforcement learning-based black-box attack that evaluates the robustness of LLM watermarking, revealing its vulnerabilities with minimal data and no internal access.
Contribution
It proposes a lightweight, black-box spoofing attack method and provides a distributional perspective to evaluate watermark resilience.
Findings
RLSpoofer achieves a 62.0% spoof success rate with minimal supervision.
Current watermarking schemes are fragile against black-box spoofing.
The framework requires only 100 human-watermarked paraphrases for effective evaluation.
Abstract
Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing fundamentally from a distributional perspective. We first establish a \textit{local capacity bottleneck}, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates while preserving semantic fidelity. Building on this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
