$B^4$: A Black-Box Scrubbing Attack on LLM Watermarks
Baizhou Huang, Xiao Pu, Xiaojun Wan

TL;DR
This paper introduces $B^4$, a novel black-box attack method that effectively removes watermarks from LLM-generated content without prior knowledge of watermark specifics, challenging current watermark robustness assumptions.
Contribution
The paper presents a new black-box scrubbing attack on LLM watermarks formulated as a constrained optimization problem, demonstrating superior performance over existing methods.
Findings
$B^4$ outperforms baseline attacks across 12 settings.
It effectively removes watermarks without prior knowledge of watermark details.
The approach is applicable in realistic black-box scenarios.
Abstract
Watermarking has emerged as a prominent technique for LLM-generated content detection by embedding imperceptible patterns. Despite supreme performance, its robustness against adversarial attacks remains underexplored. Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known. Some even necessitates knowledge about hyperparameters of the watermarking method. Such prerequisites are unattainable in real-world scenarios. Targeting at a more realistic black-box threat model with fewer assumptions, we here propose , a black-box scrubbing attack on watermarks. Specifically, we formulate the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution. This optimization problem can be approximately solved using two proxy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsCryptography and Residue Arithmetic · Cryptography and Data Security · Digital and Cyber Forensics
