New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking
Karanpartap Singh, James Zou

TL;DR
This paper introduces two new evaluation methods for LLM watermarking that reveal current techniques are detectable and can degrade text quality, emphasizing the need for better metrics.
Contribution
The paper proposes two novel evaluation approaches for LLM watermarking, providing deeper insights into their detectability and impact on text quality.
Findings
Current watermarking methods are easily detectable by classifiers.
Watermarking degrades the coherence and depth of generated text.
Existing metrics may overlook important limitations of watermarking.
Abstract
With the increasing use of large-language models (LLMs) like ChatGPT, watermarking has emerged as a promising approach for tracing machine-generated content. However, research on LLM watermarking often relies on simple perplexity or diversity-based measures to assess the quality of watermarked text, which can mask important limitations in watermarking. Here we introduce two new easy-to-use methods for evaluating watermarking algorithms for LLMs: 1) evaluation by LLM-judger with specific guidelines; and 2) binary classification on text embeddings to distinguish between watermarked and unwatermarked text. We apply these methods to characterize the effectiveness of current watermarking techniques. Our experiments, conducted across various datasets, reveal that current watermarking methods are detectable by even simple classifiers, challenging the notion of watermarking subtlety. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Artificial Intelligence in Healthcare and Education
