Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models
Minhao Bai, Kaiyi Pang, Yongfeng Huang

TL;DR
This paper introduces a learnable linguistic watermarking technique for large language models that embeds identifiable signals into output distributions, enabling effective tracing of model extraction attacks while maintaining output quality.
Contribution
It presents a novel, statistically grounded watermarking method that subtly modifies output distributions of LLMs for secure model tracing, improving robustness and detection accuracy.
Findings
Watermarking effectively distinguishes original and modified outputs.
Method maintains low false positive and negative rates.
Preserves the original performance of the language model.
Abstract
In the rapidly evolving domain of artificial intelligence, safeguarding the intellectual property of Large Language Models (LLMs) is increasingly crucial. Current watermarking techniques against model extraction attacks, which rely on signal insertion in model logits or post-processing of generated text, remain largely heuristic. We propose a novel method for embedding learnable linguistic watermarks in LLMs, aimed at tracing and preventing model extraction attacks. Our approach subtly modifies the LLM's output distribution by introducing controlled noise into token frequency distributions, embedding an statistically identifiable controllable watermark.We leverage statistical hypothesis testing and information theory, particularly focusing on Kullback-Leibler Divergence, to differentiate between original and modified distributions effectively. Our watermarking method strikes a delicate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Adversarial Robustness in Machine Learning
