Breaking Distortion-free Watermarks in Large Language Models
Shayleen Reynolds, Hengzhi He, Dung Daniel T. Ngo, Saheed Obitayo, Niccol\`o Dalmasso, Guang Cheng, Vamsi K. Potluru, Manuela Veloso

TL;DR
This paper demonstrates that even distortion-free watermarking schemes for large language models can be reverse-engineered and spoofed using adaptive prompting and sorting algorithms, challenging their assumed robustness.
Contribution
It introduces a novel method to reverse-engineer distortion-free watermarks in LLMs, revealing vulnerabilities in current watermarking schemes.
Findings
Successfully recovered watermark keys from multiple LLMs
Generated large amounts of watermarked texts for attribution
Challenged the robustness claims of existing watermarking techniques
Abstract
In recent years, LLM watermarking has emerged as an attractive safeguard against AI-generated content, with promising applications in many real-world domains. However, there are growing concerns that the current LLM watermarking schemes are vulnerable to expert adversaries wishing to reverse-engineer the watermarking mechanisms. Prior work in breaking or stealing LLM watermarks mainly focuses on the distribution-modifying algorithm of Kirchenbauer et al. (2023), which perturbs the logit vector before sampling. In this work, we focus on reverse-engineering the other prominent LLM watermarking scheme, distortion-free watermarking (Kuditipudi et al. 2024), which preserves the underlying token distribution by using a hidden watermarking key sequence. We demonstrate that, even under a more sophisticated watermarking scheme, it is possible to compromise the LLM and carry out a spoofing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Generative Adversarial Networks and Image Synthesis
MethodsFocus
