Watermark Smoothing Attacks against Language Models

Hongyan Chang; Hamed Hassani; Reza Shokri

arXiv:2407.14206·cs.LG·February 6, 2025·1 cites

Watermark Smoothing Attacks against Language Models

Hongyan Chang, Hamed Hassani, Reza Shokri

PDF

Open Access 3 Reviews

TL;DR

This paper introduces the Smoothing Attack, a novel method that exploits confidence-based watermark detectability to effectively remove watermarks from AI-generated text, revealing vulnerabilities in current watermarking schemes.

Contribution

The work presents a new watermark removal attack that leverages confidence smoothing, exposing weaknesses in existing watermarking defenses for language models.

Findings

01

The attack successfully removes watermarks across models from 1.3B to 30B parameters.

02

It works on 10 different watermark schemes, demonstrating broad applicability.

03

Existing watermarking methods are vulnerable to confidence-based smoothing attacks.

Abstract

Watermarking is a key technique for detecting AI-generated text. In this work, we study its vulnerabilities and introduce the Smoothing Attack, a novel watermark removal method. By leveraging the relationship between the model's confidence and watermark detectability, our attack selectively smoothes the watermarked content, erasing watermark traces while preserving text quality. We validate our attack on open-source models ranging from $1.3$ B to $30$ B parameters on $10$ different watermarks, demonstrating its effectiveness. Our findings expose critical weaknesses in existing watermarking schemes and highlight the need for stronger defenses.

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

The paper proposes a heuristic to estimate which tokens contribute the most to the overall watermark signal and removes the watermark by editing these tokens using another language model. The idea is interesting, and the paper empirically validates the effectiveness of their attack across different watermarks, language models, and datasets. These results clearly establish the effectiveness of the attack in practice.

Weaknesses

The paper distinguishes its main contributions from prior work by arguing that prior work on automatically removing watermarks involved using language models that were at least as strong as the original watermarked language model. However, one notable exception is the work of Zhang et al. [1], who seem to also focus on removing watermarks using weaker language models. This work is cited in the present paper but not discussed in any detail. It would be great if the authors can update their paper

Reviewer 02Rating 6Confidence 3

Strengths

- I find the proposed method very interesting and quite different from the previous work. Meanwhile, the method doesn't require a strong oracle model like a paraphrasing attack, which makes the threat model more realistic. - I really enjoy reading this paper, especially section 3.1, which gives readers a lot of insights. - The results look positive and a lot of different watermarking schemes are covered (most results are presented in the appendix).

Weaknesses

- The proposed method relies on using the logits/output probabilities of the watermarked model. This might limit the attack to some API models that may not return the logits/probabilities or only return top-k probabilities or even calibrated probabilities. - The paper uses perplexity or loss to measure the text quality, but I think it's not enough to show the quality of the text. For example, the model can generate an answer for a math question with a very low perplexity, but the answer is compl

Reviewer 03Rating 3Confidence 3

Strengths

Many existing methods for statistical watermarking have primarily concentrated on the generation and detection of watermarks. This paper takes a different approach by examining statistical watermarking from a new perspective. This perspective is interesting and may also aid in the development of improved watermark generation and detection techniques.

Weaknesses

1. The significance level $S_t$ is unobserved and was estimated using a surrogate quantity, $c_t$. Though the authors showed that there is generally a negative correlation between $c_t$ and $S_t$, this is only a weak justification. It is possible that a small $c_t$ would correspond to a large $S_t$ in some situations, e.g., when $K$ is small. 2. The method only applies to the “green-red list” watermarking scheme, which is known to be biased because it does not preserve the original text distrib

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Hate Speech and Cyberbullying Detection