Optimal Detection for Language Watermarks with Pseudorandom Collision

T. Tony Cai; Xiang Li; Qi Long; Weijie J. Su; Garrett G. Wen

arXiv:2510.22007·math.ST·January 21, 2026

Optimal Detection for Language Watermarks with Pseudorandom Collision

T. Tony Cai, Xiang Li, Qi Long, Weijie J. Su, Garrett G. Wen

PDF

TL;DR

This paper develops a statistical framework for detecting language watermarks in text generated by large language models, accounting for dependencies caused by repetition, and provides optimal detection rules with proven error control.

Contribution

It introduces a hierarchical minimal unit framework and derives closed-form optimal detection rules for watermarks under realistic dependence conditions, advancing the theoretical foundation of watermark detection.

Findings

01

Improved detection power with rigorous Type I error control.

02

Repetition-induced dependence affects watermark detection performance.

03

Optimal detection rules derived for Gumbel-max and inverse-transform watermarks.

Abstract

Text watermarking plays a crucial role in ensuring the traceability and accountability of large language model (LLM) outputs and mitigating misuse. While promising, most existing methods assume perfect pseudorandomness. In practice, repetition in generated text induces collisions that create structured dependence, compromising Type I error control and invalidating standard analyses. We introduce a statistical framework that captures this structure through a hierarchical two-layer partition. At its core is the concept of minimal units -- the smallest groups treatable as independent across units while permitting dependence within. Using minimal units, we define a non-asymptotic efficiency measure and cast watermark detection as a minimax hypothesis testing problem. Applied to Gumbel-max and inverse-transform watermarks, our framework produces closed-form optimal rules. It explains why…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.