Provably Robust Watermarks for Open-Source Language Models
Miranda Christ, Sam Gunn, Tal Malkin, Mariana Raykova

TL;DR
This paper introduces a novel watermarking scheme for open-source language models that modifies model parameters for detection, proving robustness against various attacks and enabling identification without secret model details.
Contribution
It presents the first parameter-based watermarking method for open-source LLMs that remains detectable from outputs and is proven unremovable under certain assumptions.
Findings
Robustness to token substitution and parameter perturbation attacks
Watermarks detectable from outputs without secret model info
Attackers need to significantly degrade model quality to bypass detection
Abstract
The recent explosion of high-quality language models has necessitated new methods for identifying AI-generated text. Watermarking is a leading solution and could prove to be an essential tool in the age of generative AI. Existing approaches embed watermarks at inference and crucially rely on the large language model (LLM) specification and parameters being secret, which makes them inapplicable to the open-source setting. In this work, we introduce the first watermarking scheme for open-source LLMs. Our scheme works by modifying the parameters of the model, but the watermark can be detected from just the outputs of the model. Perhaps surprisingly, we prove that our watermarks are unremovable under certain assumptions about the adversary's knowledge. To demonstrate the behavior of our construction under concrete parameter instantiations, we present experimental results with OPT-6.7B and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
