Discovering Spoofing Attempts on Language Model Watermarks
Thibaud Gloaguen, Nikola Jovanovi\'c, Robin Staab, Martin Vechev

TL;DR
This paper introduces a statistical detection method to identify spoofed watermarks in LLM-generated text, revealing that current spoofing techniques leave detectable artifacts and are less effective than previously believed.
Contribution
It is the first to propose a reliable post-hoc statistical test for detecting watermark spoofing in language models, highlighting limitations of existing spoofing methods.
Findings
High detection power across all tested spoofing methods
Current spoofing methods leave detectable artifacts
Spoofing attacks are less effective than previously thought
Abstract
LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques
