REEF: Representation Encoding Fingerprints for Large Language Models
Jie Zhang, Dongrui Liu, Chen Qian, Linfeng Zhang, Yong Liu, Yu Qiao,, Jing Shao

TL;DR
REEF is a training-free method that compares feature representations of large language models to identify relationships and protect intellectual property without impairing model capabilities.
Contribution
It introduces REEF, a novel, training-free approach using centered kernel alignment to determine model relationships based on feature representations.
Findings
Effective in identifying model relationships
Robust to fine-tuning, pruning, and permutations
Does not impair model capabilities
Abstract
Protecting the intellectual property of open-source Large Language Models (LLMs) is very important, because training LLMs costs extensive computational resources and data. Therefore, model owners and third parties need to identify whether a suspect model is a subsequent development of the victim model. To this end, we propose a training-free REEF to identify the relationship between the suspect and victim models from the perspective of LLMs' feature representations. Specifically, REEF computes and compares the centered kernel alignment similarity between the representations of a suspect model and a victim model on the same samples. This training-free REEF does not impair the model's general capabilities and is robust to sequential fine-tuning, pruning, model merging, and permutations. In this way, REEF provides a simple and effective way for third parties and models' owners to protect…
Peer Reviews
Decision·ICLR 2025 Oral
1. The proposed approach REEF - Is training-free, simple and efficient - Does not impair the model’s general capabilities - Is robust to sequential fine-tuning, pruning, model merging, and permutations - Is intuitive as feature representations of fine-tuned victim models are similar to feature representations of the original victim model, while the feature representations of unrelated models exhibit distinct distributions 2. Experimental evaluation - considers an informed adversary who
I enjoyed reading this paper, and I am happy with the current version as it has already considered all important studies in its theoretical and empirical analyses. Below, are a few suggestions which could improve the paper further: 1. This paper creates that adversary by designing a customized loss function that maximises the representational divergence between models. Is there any more effective way to create such an informed adversary? For example, designing a better loss function or perfor
This paper appears to be original, taking the known idea that models develop distinct feature representations, and introducing a novel use of the central kernel alignment score to measure similarity between these representations robustly. The scheme does not require changing the training algorithm of the original model, which is a nice property. This paper is clearly written and well organized. The experiments are comprehensive in comparing the REEF scores of several known models, considering v
The biggest weakness is that REEF does not offer any provable false positive guarantee. Furthermore, this work does not measure the false positive rate, and indeed it is challenging to measure the false positive rate in a meaningful way since real-world LLMs are so expensive to train. Although this work does compare the REEF scores of several models, this sample size is not enough to extrapolate a false positive rate in general. For example, it is unclear how REEF would perform on two models tha
- Protection of the IP of the large language model is an urgent and important topic. - The experiments are extensive to show the effectiveness of the proposed method. - The authors clearly identify the challenges and limitations of existing fingerprint methods.
The novelty is limited in replacing similarity measures of the prior work. There is a lack of theoretical analysis of proposed methods. Additionally, I'm confused about the first heatmap of Figure 6. It shows that the two LLMs trained on different datasets will have a high CKA similarity, which will lead the REEF to classify them into the same model. This means all models using the same architecture but trained by different datasets will share the same fingerprint. It doesn't make sense.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
