AttnDiff: Attention-based Differential Fingerprinting for Large Language Models
Haobo Zhang, Zhenhua Xu, Junxian Li, Shangfeng Sheng, Dezhang Kong, Meng Han

TL;DR
AttnDiff is a white-box fingerprinting method that verifies model provenance by analyzing differential attention patterns, effective across various model modifications and open-source LLM families.
Contribution
It introduces a data-efficient framework that captures intrinsic attention-based fingerprints for large language models, enabling provenance verification despite common laundering techniques.
Findings
High similarity scores (>0.98) for related derivatives across multiple models.
Effective separation of unrelated models with low similarity (<0.22).
Supports practical provenance verification with as few as 5 probes.
Abstract
Protecting the intellectual property of open-weight large language models (LLMs) requires verifying whether a suspect model is derived from a victim model despite common laundering operations such as fine-tuning (including PPO/DPO), pruning/compression, and model merging. We propose \textsc{AttnDiff}, a data-efficient white-box framework that extracts fingerprints from models via intrinsic information-routing behavior. \textsc{AttnDiff} probes minimally edited prompt pairs that induce controlled semantic conflicts, captures differential attention patterns, summarizes them with compact spectral descriptors, and compares models using CKA. Across Llama-2/3 and Qwen2.5 (3B--14B) and additional open-source families, it yields high similarity for related derivatives while separating unrelated model families (e.g., vs.\ with probes). With 5--60 multi-domain probes, it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
