HATS: High-Accuracy Triple-Set Watermarking for Large Language Models
Zhiqing Hu, Chenxu Zhao, Jiazhong Lu, Xiaolei Liu

TL;DR
This paper introduces HATS, a watermarking method for large language models that partitions vocabulary into three sets during generation, enabling high-accuracy detection of watermarked text with minimal impact on readability.
Contribution
HATS presents a novel triple-partition vocabulary scheme for watermarking LLM outputs, improving detection accuracy and maintaining text quality compared to prior methods.
Findings
High detection accuracy at fixed false positive rates
Preserves readability of generated text
Effective on Llama 2 7B model
Abstract
Misuse of LLM-generated text can be curbed by watermarking techniques that embed implicit signals into the output. We propose a watermark that partitions the vocabulary at each decoding step into three sets (Green/Yellow/Red) with fixed ratios and restricts sampling to the Green and Yellow sets. At detection time, we replay the same partitions, compute Green-enrichment and Red-depletion statistics, convert them to one-sided z-scores, and aggregate their p-values via Fisher's method to decide whether a passage is watermarked. We implement generation, detection, and testing on Llama 2 7B, and evaluate true-positive rate, false-positive rate, and text quality. Results show that the triple-partition scheme achieves high detection accuracy at fixed FPR while preserving readability.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Steganography and Watermarking Techniques · Adversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection
