Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

Yixin Cheng; Hongcheng Guo; Yangming Li; Leonid Sigal

arXiv:2505.05190·cs.LG·May 13, 2025

Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

Yixin Cheng, Hongcheng Guo, Yangming Li, Leonid Sigal

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper exposes a vulnerability in current text watermarking techniques by introducing SIRA, a paraphrasing attack that nearly always breaks watermarks without needing access to the watermarking process.

Contribution

The paper presents SIRA, a novel, efficient attack method that exploits high-entropy token embedding in watermarking, revealing a critical weakness in existing algorithms.

Findings

01

SIRA achieves nearly 100% success rate on recent watermarking methods.

02

The attack costs less than 1 USD per million tokens.

03

SIRA works without access to watermark algorithms or watermarked LLMs.

Abstract

Text watermarking aims to subtly embed statistical signals into text by controlling the Large Language Model (LLM)'s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allencheng97/self-information-rewrite-attack
pytorchOfficial

Videos

Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks· slideslive

Taxonomy

TopicsAdvanced Steganography and Watermarking Techniques · Adversarial Robustness in Machine Learning · Internet Traffic Analysis and Secure E-voting