Sandcastles in the Storm: Revisiting the (Im)possibility of Strong Watermarking

Fabrice Y Harel-Canada; Boran Erol; Connor Choi; Jason Liu; Gary Jiarui Song; Nanyun Peng; Amit Sahai

arXiv:2505.06827·cs.CR·May 13, 2025

Sandcastles in the Storm: Revisiting the (Im)possibility of Strong Watermarking

Fabrice Y Harel-Canada, Boran Erol, Connor Choi, Jason Liu, Gary Jiarui Song, Nanyun Peng, Amit Sahai

PDF

Open Access

TL;DR

This study empirically demonstrates that watermarks in AI-generated text are more robust than previously thought, due to slow mixing and imperfect quality detection, challenging existing theoretical assumptions about watermark erasure.

Contribution

The paper provides large-scale experiments showing slow mixing and flawed quality detection, revealing practical robustness of watermarks against theoretical attack models.

Findings

01

Watermark traces persist after hundreds of edits.

02

State-of-the-art quality detectors have 77% accuracy.

03

Automated attacks only remove watermarks 26% of the time.

Abstract

Watermarking AI-generated text is critical for combating misuse. Yet recent theoretical work argues that any watermark can be erased via random walk attacks that perturb text while preserving quality. However, such attacks rely on two key assumptions: (1) rapid mixing (watermarks dissolve quickly under perturbations) and (2) reliable quality preservation (automated quality oracles perfectly guide edits). Through large-scale experiments and human-validated assessments, we find mixing is slow: 100% of perturbed texts retain traces of their origin after hundreds of edits, defying rapid mixing. Oracles falter, as state-of-the-art quality detectors misjudge edits (77% accuracy), compounding errors during attacks. Ultimately, attacks underperform: automated walks remove watermarks just 26% of the time -- dropping to 10% under human quality review. These findings challenge the inevitability of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Steganography and Watermarking Techniques · Advanced Malware Detection Techniques