Watermarks for Language Models via Probabilistic Automata
Yangkun Wang, Jingbo Shang

TL;DR
This paper introduces a novel watermarking scheme for language models using probabilistic automata, achieving high diversity, efficiency, and undetectability, validated through extensive experiments on large models.
Contribution
It presents a new class of watermarking schemes with practical and theoretical variants, improving diversity, robustness, and undetectability over existing methods.
Findings
Exponential generation diversity achieved
High robustness demonstrated on large models
Scheme offers formal undetectability guarantees
Abstract
A recent watermarking scheme for language models achieves distortion-free embedding and robustness to edit-distance attacks. However, it suffers from limited generation diversity and high detection overhead. In parallel, recent research has focused on undetectability, a property ensuring that watermarks remain difficult for adversaries to detect and spoof. In this work, we introduce a new class of watermarking schemes constructed through probabilistic automata. We present two instantiations: (i) a practical scheme with exponential generation diversity and computational efficiency, and (ii) a theoretical construction with formal undetectability guarantees under cryptographic assumptions. Extensive experiments on LLaMA-3B and Mistral-7B validate the superior performance of our scheme in terms of robustness and efficiency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Formal Methods in Verification · Machine Learning and Algorithms
