$B^4$: A Black-Box Scrubbing Attack on LLM Watermarks

Baizhou Huang; Xiao Pu; Xiaojun Wan

arXiv:2411.01222·cs.CL·November 8, 2024

$B^4$: A Black-Box Scrubbing Attack on LLM Watermarks

Baizhou Huang, Xiao Pu, Xiaojun Wan

PDF

Open Access 1 Video

TL;DR

This paper introduces $B^4$, a novel black-box attack method that effectively removes watermarks from LLM-generated content without prior knowledge of watermark specifics, challenging current watermark robustness assumptions.

Contribution

The paper presents a new black-box scrubbing attack on LLM watermarks formulated as a constrained optimization problem, demonstrating superior performance over existing methods.

Findings

01

$B^4$ outperforms baseline attacks across 12 settings.

02

It effectively removes watermarks without prior knowledge of watermark details.

03

The approach is applicable in realistic black-box scenarios.

Abstract

Watermarking has emerged as a prominent technique for LLM-generated content detection by embedding imperceptible patterns. Despite supreme performance, its robustness against adversarial attacks remains underexplored. Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known. Some even necessitates knowledge about hyperparameters of the watermarking method. Such prerequisites are unattainable in real-world scenarios. Targeting at a more realistic black-box threat model with fewer assumptions, we here propose $B^{4}$ , a black-box scrubbing attack on watermarks. Specifically, we formulate the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution. This optimization problem can be approximately solved using two proxy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

B4 : A Black-Box Scrubbing Attack on LLM Watermarks· underline

Taxonomy

TopicsCryptography and Residue Arithmetic · Cryptography and Data Security · Digital and Cyber Forensics