Breaking Distortion-free Watermarks in Large Language Models

Shayleen Reynolds; Hengzhi He; Dung Daniel T. Ngo; Saheed Obitayo; Niccol\`o Dalmasso; Guang Cheng; Vamsi K. Potluru; Manuela Veloso

arXiv:2502.18608·cs.CR·June 13, 2025

Breaking Distortion-free Watermarks in Large Language Models

Shayleen Reynolds, Hengzhi He, Dung Daniel T. Ngo, Saheed Obitayo, Niccol\`o Dalmasso, Guang Cheng, Vamsi K. Potluru, Manuela Veloso

PDF

Open Access

TL;DR

This paper demonstrates that even distortion-free watermarking schemes for large language models can be reverse-engineered and spoofed using adaptive prompting and sorting algorithms, challenging their assumed robustness.

Contribution

It introduces a novel method to reverse-engineer distortion-free watermarks in LLMs, revealing vulnerabilities in current watermarking schemes.

Findings

01

Successfully recovered watermark keys from multiple LLMs

02

Generated large amounts of watermarked texts for attribution

03

Challenged the robustness claims of existing watermarking techniques

Abstract

In recent years, LLM watermarking has emerged as an attractive safeguard against AI-generated content, with promising applications in many real-world domains. However, there are growing concerns that the current LLM watermarking schemes are vulnerable to expert adversaries wishing to reverse-engineer the watermarking mechanisms. Prior work in breaking or stealing LLM watermarks mainly focuses on the distribution-modifying algorithm of Kirchenbauer et al. (2023), which perturbs the logit vector before sampling. In this work, we focus on reverse-engineering the other prominent LLM watermarking scheme, distortion-free watermarking (Kuditipudi et al. 2024), which preserves the underlying token distribution by using a hidden watermarking key sequence. We demonstrate that, even under a more sophisticated watermarking scheme, it is possible to compromise the LLM and carry out a spoofing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Generative Adversarial Networks and Image Synthesis

MethodsFocus