Revisiting the Robustness of Watermarking to Paraphrasing Attacks
Saksham Rastogi, Danish Pruthi

TL;DR
This paper critically examines the robustness of text watermarking techniques against paraphrasing attacks, revealing that limited access to watermarked outputs enables effective evasion, thus challenging their reliability.
Contribution
The study demonstrates that existing watermarking schemes can be easily circumvented with limited black-box access, highlighting vulnerabilities and the need for more robust methods.
Findings
Limited paraphrasing attacks can effectively evade watermark detection.
Current watermarking methods are vulnerable to reverse-engineering.
Robustness claims of some watermarking techniques are overstated.
Abstract
Amidst rising concerns about the internet being proliferated with content generated from language models (LMs), watermarking is seen as a principled way to certify whether text was generated from a model. Many recent watermarking techniques slightly modify the output probabilities of LMs to embed a signal in the generated output that can later be detected. Since early proposals for text watermarking, questions about their robustness to paraphrasing have been prominently discussed. Lately, some techniques are deliberately designed and claimed to be robust to paraphrasing. However, such watermarking schemes do not adequately account for the ease with which they can be reverse-engineered. We show that with access to only a limited number of generations from a black-box watermarked model, we can drastically increase the effectiveness of paraphrasing attacks to evade watermark detection,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Steganography and Watermarking Techniques · Internet Traffic Analysis and Secure E-voting · Digital Media Forensic Detection
