TL;DR
PRO introduces a novel watermarking technique for open-source LLMs that ensures precise embedding and robustness against modifications, enabling effective verification of text origin without degrading model performance.
Contribution
The paper presents PRO, a joint training method that embeds robust watermarks into open-source LLMs by optimizing detectability and resilience to downstream modifications.
Findings
Significantly improves watermark detectability in open-source LLMs.
Enhances robustness of watermarks against fine-tuning and model merging.
Demonstrates effectiveness on models like LLaMA-3.2, LLaMA-3, and Phi-2.
Abstract
Text watermarking for large language models (LLMs) enables model owners to verify text origin and protect intellectual property. While watermarking methods for closed-source LLMs are relatively mature, extending them to open-source models remains challenging, as developers cannot control the decoding process. Consequently, owners of open-source LLMs lack practical means to verify whether text was generated by their models. A core difficulty lies in embedding watermarks directly into model weights without hurting detectability. A promising idea is to distill watermarks from a closed-source model into an open one, but this suffers from (i) poor detectability due to mismatch between learned and predefined patterns, and (ii) fragility to downstream modifications such as fine-tuning or model merging. To overcome these limitations, we propose PRO, a Precise and Robust text watermarking method…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Effective and robust watermarking for open-weight LLMs is an important open problem. As open-weight LLMs become more capable and widely used, combating LLM misuse via methods such as watermarking become more important. 2. The proposed method seems like a natural way to approach the problem. It simultaneously optimizes the watermark policy to increase detectability, along with optimizing against degradation in detectability from a simulated gradient update step on red tokens. 3. The code i
1. Watermark detectability still drops significantly in PRO after fine-tuning. The TPR@5 decreases from 0.99 to 0.37 after 1500 fine-tuning steps on OpenMath Instruct, which I’m not sure I would call “robust”. 2. The numbers reported for the Gloaguen et al. (2025) method in Table 1 do not match up with the numbers they reported, even though the experimental setups seem to be mostly the same. [Gloaguen et al. (2025)](https://arxiv.org/abs/2502.10525) (Table 1\) reports 0.69 TPR@5 after 2,500 fi
1. Identify the problem of Generation-Detection Inconsistency. The mappings of watermarked tokens are arbitrary. 2. Provide a novel method co-adapting the watmeark model with the real model to better align the watermark with the model's innate performance. And innovatively devise the FPL module to properly solve the weakness of the current open-source model watermark to finetuning. 3. carry out experiment validating the performance of PRO.
1. Using model merging as an attack to evaluate learning-based watermarking may be inappropriate, since such attacks assume access to an unwatermarked model. In my opinion model merging shouldn't be considered as a valid attack. 2. Because a key component of CAWP relies on an MLP that extracts semantic information through a BERT encoder, it would be important to include comparisons with prior semantic-invariant distillation method to demonstrate the necessity and contribution of the co-training
- The paper addresses a practical and growing problem: watermarking open-source LLMs where owners lack control over decoding. - The experiments across multiple open-source models demonstrate that PRO yields higher watermark detectability and improved resistance to post-training modification compared to baseline methods. - The paper is clearly structured, with intuitive figures.
- The paper does not provide a formal analysis or theoretical guarantee on why the joint optimization leads to higher detectability or robustness. A more rigorous treatment (e.g., gradient alignment or mutual information perspective) would strengthen the claims. - While PRO aims for “precise and robust” watermarking, the authors do not systematically evaluate how the approach affects the model’s general performance (e.g., perplexity, generation quality, reasoning accuracy) on more diverse and br
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
