PRO: Enabling Precise and Robust Text Watermark for Open-Source LLMs

Jiaqi Xue; Yifei Zhao; Mansour Al Ghanim; Shangqian Gao; Ruimin Sun; Qian Lou; Mengxin Zheng

arXiv:2510.23891·cs.CR·October 29, 2025

PRO: Enabling Precise and Robust Text Watermark for Open-Source LLMs

Jiaqi Xue, Yifei Zhao, Mansour Al Ghanim, Shangqian Gao, Ruimin Sun, Qian Lou, Mengxin Zheng

PDF

3 Reviews

TL;DR

PRO introduces a novel watermarking technique for open-source LLMs that ensures precise embedding and robustness against modifications, enabling effective verification of text origin without degrading model performance.

Contribution

The paper presents PRO, a joint training method that embeds robust watermarks into open-source LLMs by optimizing detectability and resilience to downstream modifications.

Findings

01

Significantly improves watermark detectability in open-source LLMs.

02

Enhances robustness of watermarks against fine-tuning and model merging.

03

Demonstrates effectiveness on models like LLaMA-3.2, LLaMA-3, and Phi-2.

Abstract

Text watermarking for large language models (LLMs) enables model owners to verify text origin and protect intellectual property. While watermarking methods for closed-source LLMs are relatively mature, extending them to open-source models remains challenging, as developers cannot control the decoding process. Consequently, owners of open-source LLMs lack practical means to verify whether text was generated by their models. A core difficulty lies in embedding watermarks directly into model weights without hurting detectability. A promising idea is to distill watermarks from a closed-source model into an open one, but this suffers from (i) poor detectability due to mismatch between learned and predefined patterns, and (ii) fragility to downstream modifications such as fine-tuning or model merging. To overcome these limitations, we propose PRO, a Precise and Robust text watermarking method…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. Effective and robust watermarking for open-weight LLMs is an important open problem. As open-weight LLMs become more capable and widely used, combating LLM misuse via methods such as watermarking become more important. 2. The proposed method seems like a natural way to approach the problem. It simultaneously optimizes the watermark policy to increase detectability, along with optimizing against degradation in detectability from a simulated gradient update step on red tokens. 3. The code i

Weaknesses

1. Watermark detectability still drops significantly in PRO after fine-tuning. The TPR@5 decreases from 0.99 to 0.37 after 1500 fine-tuning steps on OpenMath Instruct, which I’m not sure I would call “robust”. 2. The numbers reported for the Gloaguen et al. (2025) method in Table 1 do not match up with the numbers they reported, even though the experimental setups seem to be mostly the same. [Gloaguen et al. (2025)](https://arxiv.org/abs/2502.10525) (Table 1\) reports 0.69 TPR@5 after 2,500 fi

Reviewer 02Rating 6Confidence 3

Strengths

1. Identify the problem of Generation-Detection Inconsistency. The mappings of watermarked tokens are arbitrary. 2. Provide a novel method co-adapting the watmeark model with the real model to better align the watermark with the model's innate performance. And innovatively devise the FPL module to properly solve the weakness of the current open-source model watermark to finetuning. 3. carry out experiment validating the performance of PRO.

Weaknesses

1. Using model merging as an attack to evaluate learning-based watermarking may be inappropriate, since such attacks assume access to an unwatermarked model. In my opinion model merging shouldn't be considered as a valid attack. 2. Because a key component of CAWP relies on an MLP that extracts semantic information through a BERT encoder, it would be important to include comparisons with prior semantic-invariant distillation method to demonstrate the necessity and contribution of the co-training

Reviewer 03Rating 6Confidence 3

Strengths

- The paper addresses a practical and growing problem: watermarking open-source LLMs where owners lack control over decoding. - The experiments across multiple open-source models demonstrate that PRO yields higher watermark detectability and improved resistance to post-training modification compared to baseline methods. - The paper is clearly structured, with intuitive figures.

Weaknesses

- The paper does not provide a formal analysis or theoretical guarantee on why the joint optimization leads to higher detectability or robustness. A more rigorous treatment (e.g., gradient alignment or mutual information perspective) would strengthen the claims. - While PRO aims for “precise and robust” watermarking, the authors do not systematically evaluate how the approach affects the model’s general performance (e.g., perplexity, generation quality, reasoning accuracy) on more diverse and br

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.