Discovering Spoofing Attempts on Language Model Watermarks

Thibaud Gloaguen; Nikola Jovanovi\'c; Robin Staab; Martin Vechev

arXiv:2410.02693·cs.CR·May 23, 2025

Discovering Spoofing Attempts on Language Model Watermarks

Thibaud Gloaguen, Nikola Jovanovi\'c, Robin Staab, Martin Vechev

PDF

Open Access 1 Repo

TL;DR

This paper introduces a statistical detection method to identify spoofed watermarks in LLM-generated text, revealing that current spoofing techniques leave detectable artifacts and are less effective than previously believed.

Contribution

It is the first to propose a reliable post-hoc statistical test for detecting watermark spoofing in language models, highlighting limitations of existing spoofing methods.

Findings

01

High detection power across all tested spoofing methods

02

Current spoofing methods leave detectable artifacts

03

Spoofing attacks are less effective than previously thought

Abstract

LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eth-sri/watermark-spoofing-detection
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques