RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

Hanbo Huang; Xuan Gong; Yiran Zhang; Hao Zheng; Shiyu Liang

arXiv:2604.11546·cs.CR·April 14, 2026

RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

Hanbo Huang, Xuan Gong, Yiran Zhang, Hao Zheng, Shiyu Liang

PDF

TL;DR

This paper introduces RLSpoofer, a reinforcement learning-based black-box attack that evaluates the robustness of LLM watermarking, revealing its vulnerabilities with minimal data and no internal access.

Contribution

It proposes a lightweight, black-box spoofing attack method and provides a distributional perspective to evaluate watermark resilience.

Findings

01

RLSpoofer achieves a 62.0% spoof success rate with minimal supervision.

02

Current watermarking schemes are fragile against black-box spoofing.

03

The framework requires only 100 human-watermarked paraphrases for effective evaluation.

Abstract

Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing fundamentally from a distributional perspective. We first establish a \textit{local capacity bottleneck}, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates while preserving semantic fidelity. Building on this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.