Modification and Generated-Text Detection: Achieving Dual Detection Capabilities for the Outputs of LLM by Watermark
Yuhang Cai, Yaofei Wang, Donghui Hu, Chen Gu

TL;DR
This paper presents a novel watermarking technique for LLM outputs that enables simultaneous detection of modifications and generated text, enhancing the security and integrity of language model outputs.
Contribution
The authors introduce a new unbiased watermark method and a 'discarded tokens' metric to detect both modifications and generated text in LLM outputs, addressing limitations of existing methods.
Findings
Effective dual detection of modifications and generated text achieved
New metric 'discarded tokens' correlates with modifications
Improved watermark detection robustness demonstrated
Abstract
The development of large language models (LLMs) has raised concerns about potential misuse. One practical solution is to embed a watermark in the text, allowing ownership verification through watermark extraction. Existing methods primarily focus on defending against modification attacks, often neglecting other spoofing attacks. For example, attackers can alter the watermarked text to produce harmful content without compromising the presence of the watermark, which could lead to false attribution of this malicious content to the LLM. This situation poses a serious threat to the LLMs service providers and highlights the significance of achieving modification detection and generated-text detection simultaneously. Therefore, we propose a technique to detect modifications in text for unbiased watermark which is sensitive to modification. We introduce a new metric called ``discarded tokens",…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Spam and Phishing Detection · Authorship Attribution and Profiling
Methodstravel james · Focus
