Ensemble Watermarks for Large Language Models
Georg Niess, Roman Kern

TL;DR
This paper introduces an ensemble watermarking technique for large language models that combines multiple features to improve detection accuracy and robustness against paraphrasing attacks, enhancing accountability.
Contribution
It proposes a novel multi-feature ensemble watermarking method that significantly improves detection rates and robustness over existing single-feature watermarks for LLMs.
Findings
98% detection rate with ensemble watermark
95% detection rate after paraphrasing attack
Superior performance compared to baseline watermarks
Abstract
As large language models (LLMs) reach human-like fluency, reliably distinguishing AI-generated text from human authorship becomes increasingly difficult. While watermarks already exist for LLMs, they often lack flexibility and struggle with attacks such as paraphrasing. To address these issues, we propose a multi-feature method for generating watermarks that combines multiple distinct watermark features into an ensemble watermark. Concretely, we combine acrostica and sensorimotor norms with the established red-green watermark to achieve a 98% detection rate. After a paraphrasing attack, the performance remains high with 95% detection rate. In comparison, the red-green feature alone as a baseline achieves a detection rate of 49% after paraphrasing. The evaluation of all feature combinations reveals that the ensemble of all three consistently has the highest detection rate across several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
