Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on   Large Language Models

Minhao Bai; Kaiyi Pang; Yongfeng Huang

arXiv:2405.01509·cs.CR·May 3, 2024

Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models

Minhao Bai, Kaiyi Pang, Yongfeng Huang

PDF

Open Access

TL;DR

This paper introduces a learnable linguistic watermarking technique for large language models that embeds identifiable signals into output distributions, enabling effective tracing of model extraction attacks while maintaining output quality.

Contribution

It presents a novel, statistically grounded watermarking method that subtly modifies output distributions of LLMs for secure model tracing, improving robustness and detection accuracy.

Findings

01

Watermarking effectively distinguishes original and modified outputs.

02

Method maintains low false positive and negative rates.

03

Preserves the original performance of the language model.

Abstract

In the rapidly evolving domain of artificial intelligence, safeguarding the intellectual property of Large Language Models (LLMs) is increasingly crucial. Current watermarking techniques against model extraction attacks, which rely on signal insertion in model logits or post-processing of generated text, remain largely heuristic. We propose a novel method for embedding learnable linguistic watermarks in LLMs, aimed at tracing and preventing model extraction attacks. Our approach subtly modifies the LLM's output distribution by introducing controlled noise into token frequency distributions, embedding an statistically identifiable controllable watermark.We leverage statistical hypothesis testing and information theory, particularly focusing on Kullback-Leibler Divergence, to differentiate between original and modified distributions effectively. Our watermarking method strikes a delicate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Adversarial Robustness in Machine Learning