Class-feature Watermark: A Resilient Black-box Watermark Against Model Extraction Attacks

Yaxin Xiao; Qingqing Ye; Zi Liang; Haoyang Li; RongHua Li; Huadi Zheng; Haibo Hu

arXiv:2511.07947·cs.CR·November 18, 2025

Class-feature Watermark: A Resilient Black-box Watermark Against Model Extraction Attacks

Yaxin Xiao, Qingqing Ye, Zi Liang, Haoyang Li, RongHua Li, Huadi Zheng, Haibo Hu

PDF

Open Access 1 Video

TL;DR

This paper introduces Class-Feature Watermarks (CFW), a novel black-box watermarking method that significantly enhances resilience against model extraction and removal attacks, ensuring ownership verification without compromising model utility.

Contribution

The paper proposes CFW, a new watermarking approach leveraging class-level artifacts and out-of-domain samples to improve robustness against sophisticated attacks.

Findings

01

CFW maintains at least 70.15% watermark success rate under combined attacks.

02

WRK reduces watermark success by at least 88.79% in existing benchmarks.

03

CFW outperforms prior methods in resilience while preserving model utility.

Abstract

Machine learning models constitute valuable intellectual property, yet remain vulnerable to model extraction attacks (MEA), where adversaries replicate their functionality through black-box queries. Model watermarking counters MEAs by embedding forensic markers for ownership verification. Current black-box watermarks prioritize MEA survival through representation entanglement, yet inadequately explore resilience against sequential MEAs and removal attacks. Our study reveals that this risk is underestimated because existing removal methods are weakened by entanglement. To address this gap, we propose Watermark Removal attacK (WRK), which circumvents entanglement constraints by exploiting decision boundaries shaped by prevailing sample-level watermark artifacts. WRK effectively reduces watermark success rates by at least 88.79% across existing watermarking benchmarks. For robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Class-feature Watermark: A Resilient Black-box Watermark Against Model Extraction Attacks· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Physical Unclonable Functions (PUFs) and Hardware Security