Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?

Leyi Pan; Aiwei Liu; Shiyu Huang; Yijian Lu; Xuming Hu; Lijie Wen; Irwin King; Philip S. Yu

arXiv:2502.11598·cs.CL·May 27, 2025

Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?

Leyi Pan, Aiwei Liu, Shiyu Huang, Yijian Lu, Xuming Hu, Lijie Wen, Irwin King, Philip S. Yu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the robustness of LLM watermarks against removal techniques, revealing that current methods can effectively neutralize watermarks while preserving knowledge transfer, highlighting the need for more resilient watermarking strategies.

Contribution

The study introduces and evaluates watermark removal methods, demonstrating their effectiveness and efficiency, and underscores the necessity for improved watermark robustness in LLMs.

Findings

01

Watermark removal methods can fully eliminate inherited watermarks.

02

Post-distillation watermark neutralization maintains knowledge transfer.

03

Watermark removal approaches are computationally efficient.

Abstract

The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-bpm/watermark-radioactivity-attack
noneOfficial

Videos

Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?· underline

Taxonomy

TopicsNatural Language Processing Techniques · Library Science and Information Systems · Data Quality and Management

MethodsKnowledge Distillation