CoTSRF: Utilize Chain of Thought as Stealthy and Robust Fingerprint of Large Language Models
Zhenzhen Ren, GuoBiao Li, Sheng Li, Zhenxing Qian, Xinpeng Zhang

TL;DR
This paper introduces CoTSRF, a novel method for fingerprinting large language models using Chain of Thought responses, which enhances stealthiness and robustness in identifying specific models behind suspect applications.
Contribution
The paper presents a new LLM fingerprinting scheme utilizing Chain of Thought responses and contrastive learning to improve stealth and robustness over existing methods.
Findings
CoTSRF effectively identifies source LLMs with high accuracy.
It demonstrates robustness against adversarial attempts to hide fingerprints.
The method is more stealthy compared to prior fingerprinting techniques.
Abstract
Despite providing superior performance, open-source large language models (LLMs) are vulnerable to abusive usage. To address this issue, recent works propose LLM fingerprinting methods to identify the specific source LLMs behind suspect applications. However, these methods fail to provide stealthy and robust fingerprint verification. In this paper, we propose a novel LLM fingerprinting scheme, namely CoTSRF, which utilizes the Chain of Thought (CoT) as the fingerprint of an LLM. CoTSRF first collects the responses from the source LLM by querying it with crafted CoT queries. Then, it applies contrastive learning to train a CoT extractor that extracts the CoT feature (i.e., fingerprint) from the responses. Finally, CoTSRF conducts fingerprint verification by comparing the Kullback-Leibler divergence between the CoT features of the source and suspect LLMs against an empirical threshold.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Authorship Attribution and Profiling · Hate Speech and Cyberbullying Detection
MethodsContrastive Learning
