Vaporizer: Breaking Watermarking Schemes for Large Language Model Outputs
Jonathan Hong Jin Ng, Anh Tu Ngo, Anupam Chattopadhyay

TL;DR
This paper evaluates the robustness of watermarking techniques for large language models against various semantic-preserving attacks, revealing vulnerabilities and guiding improvements for security.
Contribution
It provides a comprehensive analysis of watermarking schemes' effectiveness against diverse attack strategies, highlighting their weaknesses and suggesting directions for enhancement.
Findings
Watermark removal is feasible with reasonable effort.
Effectiveness varies among different watermarking models.
Semantic content preservation is challenged by attack strategies.
Abstract
In this paper, we investigate the recent state-of-the-art schemes for watermarking large language models (LLMs) outputs. These techniques are claimed to be robust, scalable and production-grade, aimed at promoting responsible usage of LLMs. We analyse the effectiveness of these watermarking techniques against an extensive collection of modified text attacks, which perform targeted semantic changes without altering the general meaning of the text content. Our approach encompasses multiple attack strategies, which include lexical alterations, machine translation, and even neural paraphrasing. The attack efficacy is measured with two target criteria - successful removal of the watermark and preservation of semantic content. We evaluate semantic preservation through BERT scores, text complexity measures, grammatical errors, and Flesch Reading Ease indices. The experimental results reveal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
