Distillation-Resistant Watermarking for Model Protection in NLP

Xuandong Zhao; Lei Li; Yu-Xiang Wang

arXiv:2210.03312·cs.CL·October 25, 2022

Distillation-Resistant Watermarking for Model Protection in NLP

Xuandong Zhao, Lei Li, Yu-Xiang Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Distillation-Resistant Watermarking (DRW), a novel method to protect NLP models from theft via distillation by embedding detectable watermarks into model predictions without significantly affecting accuracy.

Contribution

The paper presents DRW, the first watermarking technique specifically designed for NLP models that resists model distillation and effectively detects stolen models.

Findings

01

DRW achieves 100% detection precision across multiple NLP tasks.

02

DRW maintains original model accuracy within a certain bound.

03

Prior methods fail to detect theft in some tasks.

Abstract

How can we protect the intellectual property of trained NLP models? Modern NLP models are prone to stealing by querying and distilling from their publicly exposed APIs. However, existing protection methods such as watermarking only work for images but are not applicable to text. We propose Distillation-Resistant Watermarking (DRW), a novel technique to protect NLP models from being stolen via distillation. DRW protects a model by injecting watermarks into the victim's prediction probability corresponding to a secret key and is able to detect such a key by probing a suspect model. We prove that a protected model still retains the original accuracy within a certain bound. We evaluate DRW on a diverse set of NLP tasks including text classification, part-of-speech tagging, and named entity recognition. Experiments show that DRW protects the original model and detects stealing suspects at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xuandongzhao/drw
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques