A Nested Watermark for Large Language Models
Koichi Nagatsuka, Terufumi Morishita, Yasuhiro Sogawa

TL;DR
This paper introduces a nested watermarking technique for large language models that embeds two independent watermarks, enhancing traceability and robustness against key leaks without compromising text quality.
Contribution
The paper proposes a novel nested watermarking scheme with two independent keys, improving authorship attribution and robustness over existing single-key methods.
Findings
High detection accuracy for both watermarks
Maintains text fluency and quality
Robust against key leakage scenarios
Abstract
The rapid advancement of large language models (LLMs) has raised concerns regarding their potential misuse, particularly in generating fake news and misinformation. To address these risks, watermarking techniques for autoregressive language models have emerged as a promising means for detecting LLM-generated text. Existing methods typically embed a watermark by increasing the probabilities of tokens within a group selected according to a single secret key. However, this approach suffers from a critical limitation: if the key is leaked, it becomes impossible to trace the text's provenance or attribute authorship. To overcome this vulnerability, we propose a novel nested watermarking scheme that embeds two distinct watermarks into the generated text using two independent keys. This design enables reliable authorship identification even in the event that one key is compromised.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
