A Certified Robust Watermark For Large Language Models
Xianheng Feng, Jian Liu, Kui Ren, Chun Chen

TL;DR
This paper introduces a novel certified robust watermarking method for large language models using randomized smoothing, providing provable guarantees against various attacks while maintaining performance.
Contribution
We propose the first certified robust watermark algorithm for large language models based on randomized smoothing, enhancing robustness with provable guarantees.
Findings
Achieves comparable performance to baseline watermark algorithms.
Provides substantial certified robustness against removal attacks.
Demonstrates effectiveness through comprehensive empirical evaluations.
Abstract
The effectiveness of watermark algorithms in AI-generated text identification has garnered significant attention. Concurrently, an increasing number of watermark algorithms have been proposed to enhance the robustness against various watermark attacks. However, these watermark algorithms remain susceptible to adaptive or unseen attacks. To address this issue, to our best knowledge, we propose the first certified robust watermark algorithm for large language models based on randomized smoothing, which can provide provable guarantees for watermarked text. Specifically, we utilize two different models respectively for watermark generation and detection and add Gaussian and Uniform noise respectively in the embedding and permutation space during the training and inference stages of the watermark detector to enhance the certified robustness of our watermark detector and derive certified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
