Optimizing Adaptive Attacks against Watermarks for Language Models
Abdulrahman Diaa, Toluwani Aremu, Nils Lukas

TL;DR
This paper demonstrates that adaptive, optimization-based attacks can effectively evade watermark detection in large language models, highlighting the need for more robust watermarking methods.
Contribution
It introduces a preference-based optimization approach to tune adaptive attacks against watermarking methods in LLMs, revealing their vulnerabilities.
Findings
Adaptive attacks evade all surveyed watermarks
Training against one watermark evades others
Optimization-based attacks are cost-effective
Abstract
Large Language Models (LLMs) can be misused to spread unwanted content at scale. Content watermarking deters misuse by hiding messages in content, enabling its detection using a secret watermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against non-adaptive attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate watermark robustness as an objective function and use preference-based optimization to tune adaptive attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks evade detection against all surveyed watermarks, (ii) training against any watermark succeeds in evading unseen watermarks, and (iii) optimization-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗DDiaa/WM-Removal-EXP-Qwen2.5-0.5Bmodel· 1 dl1 dl
- 🤗DDiaa/WM-Removal-EXP-Qwen2.5-1.5Bmodel· 2 dl2 dl
- 🤗DDiaa/WM-Removal-EXP-Qwen2.5-3Bmodel· 1 dl1 dl
- 🤗DDiaa/WM-Removal-EXP-Qwen2.5-7Bmodel· 2 dl2 dl
- 🤗DDiaa/WM-Removal-EXP-Llama-2-7Bmodel· 1 dl1 dl
- 🤗DDiaa/WM-Removal-Unigram-Llama-3.2-3Bmodel· 1 dl1 dl
- 🤗DDiaa/WM-Removal-Unigram-Qwen2.5-3Bmodel· 5 dl5 dl
- 🤗DDiaa/WM-Removal-KGW-Llama-3.1-8Bmodel· 16 dl16 dl
- 🤗DDiaa/WM-Removal-KGW-Llama-2-7Bmodel· 25 dl25 dl
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques
