HATS: High-Accuracy Triple-Set Watermarking for Large Language Models

Zhiqing Hu; Chenxu Zhao; Jiazhong Lu; Xiaolei Liu

arXiv:2512.19378·cs.CL·December 23, 2025

HATS: High-Accuracy Triple-Set Watermarking for Large Language Models

Zhiqing Hu, Chenxu Zhao, Jiazhong Lu, Xiaolei Liu

PDF

Open Access

TL;DR

This paper introduces HATS, a watermarking method for large language models that partitions vocabulary into three sets during generation, enabling high-accuracy detection of watermarked text with minimal impact on readability.

Contribution

HATS presents a novel triple-partition vocabulary scheme for watermarking LLM outputs, improving detection accuracy and maintaining text quality compared to prior methods.

Findings

01

High detection accuracy at fixed false positive rates

02

Preserves readability of generated text

03

Effective on Llama 2 7B model

Abstract

Misuse of LLM-generated text can be curbed by watermarking techniques that embed implicit signals into the output. We propose a watermark that partitions the vocabulary at each decoding step into three sets (Green/Yellow/Red) with fixed ratios and restricts sampling to the Green and Yellow sets. At detection time, we replay the same partitions, compute Green-enrichment and Red-depletion statistics, convert them to one-sided z-scores, and aggregate their p-values via Fisher's method to decide whether a passage is watermarked. We implement generation, detection, and testing on Llama 2 7B, and evaluate true-positive rate, false-positive rate, and text quality. Results show that the triple-partition scheme achieves high detection accuracy at fixed FPR while preserving readability.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Steganography and Watermarking Techniques · Adversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection