A Behavioral Fingerprint for Large Language Models: Provenance Tracking via Refusal Vectors
Zhenyu Xu, Victor S. Sheng

TL;DR
This paper presents a behavioral fingerprinting method for large language models using refusal vectors, enabling robust provenance tracking and IP protection even after model modifications.
Contribution
It introduces a novel fingerprinting framework based on refusal vectors and validates its effectiveness for large-scale model identification and IP protection.
Findings
Achieves 100% accuracy in identifying model families across 76 models.
Fingerprint remains robust against finetuning, merging, and quantization.
Detects traces of alignment-breaking attacks, enabling security analysis.
Abstract
Protecting the intellectual property of large language models (LLMs) is a critical challenge due to the proliferation of unauthorized derivative models. We introduce a novel fingerprinting framework that leverages the behavioral patterns induced by safety alignment, applying the concept of refusal vectors for LLM provenance tracking. These vectors, extracted from directional patterns in a model's internal representations when processing harmful versus harmless prompts, serve as robust behavioral fingerprints. Our contribution lies in developing a fingerprinting system around this concept and conducting extensive validation of its effectiveness for IP protection. We demonstrate that these behavioral fingerprints are highly robust against common modifications, including finetunes, merges, and quantization. Our experiments show that the fingerprint is unique to each model family, with low…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Scientific Computing and Data Management · Explainable Artificial Intelligence (XAI)
