Dataset Ownership Verification in Contrastive Pre-trained Models
Yuechen Xie, Jie Song, Mengqi Xue, Haofei Zhang, Xingen Wang, Bingde, Hu, Genlang Chen, Mingli Song

TL;DR
This paper introduces a novel dataset ownership verification method for self-supervised contrastive pre-trained models, enabling dataset owners to verify if a model was trained on their dataset, which was previously unaddressed.
Contribution
It is the first to tailor dataset ownership verification specifically for contrastive self-supervised models, leveraging differences in embedding space relationships.
Findings
Effective verification across multiple contrastive models.
Significant statistical rejection of null hypothesis with low p-values.
Outperforms previous methods in accuracy and reliability.
Abstract
High-quality open-source datasets, which necessitate substantial efforts for curation, has become the primary catalyst for the swift progress of deep learning. Concurrently, protecting these datasets is paramount for the well-being of the data owner. Dataset ownership verification emerges as a crucial method in this domain, but existing approaches are often limited to supervised models and cannot be directly extended to increasingly popular unsupervised pre-trained models. In this work, we propose the first dataset ownership verification method tailored specifically for self-supervised pre-trained models by contrastive learning. Its primary objective is to ascertain whether a suspicious black-box backbone has been pre-trained on a specific unlabeled dataset, aiding dataset owners in upholding their rights. The proposed approach is motivated by our empirical insights that when models are…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper presents a unique dataset ownership verification (DOV) method specifically tailored for self-supervised contrastive learning models. This is a valuable addition to the field, as existing DOV methods are generally focused on supervised or non-contrastive learning models, leaving a gap that this paper addresses. -The authors conduct extensive experiments across multiple datasets (CIFAR10, CIFAR100, SVHN, ImageNet variants) and contrastive learning architectures (SimCLR, BYOL, MoCo, DIN
- While the proposed method offers a novel approach to dataset ownership verification, its applicability is limited to contrastive learning models. Many self-supervised learning models use objectives other than contrastive learning, so expanding the method’s scope could enhance its impact. However, this limitation is relatively minor. - In line 488, the authors state that "the private training method does not affect our verification results," but this claim is based on experiments using only DP-
1. **Innovative Approach:** The method uniquely applies to self-supervised models by leveraging characteristics of contrastive learning, filling a gap in current DOV methods that primarily target supervised learning. 2. **Black-box Applicability:** The approach is suitable for black-box scenarios, which is practical and aligned with real-world applications where full model access is unavailable. The approach demonstrates robust performance across different datasets (e.g., CIFAR, ImageNet) and ar
1. **Dependency on Feature Representation Access:** The method requires access to feature representations, which might not be feasible in all practical scenarios, as many services limit this access for security reasons. 2. **Limited Application to Non-Contrastive Pre-Trained Models:** The method’s effectiveness is constrained to contrastive learning. Other prevalent pre-training strategies, such as masked image modeling (MIM), are not effectively addressed, potentially limiting applicability. H
This paper addresses an important and novel problem—dataset copyright protection in contrastive learning. The authors provide a comprehensive range of experiments, and the proposed method consistently demonstrates outstanding results across all tested settings.
I have several concerns: 1. The proposed unary and binary relationships align with the goals of contrastive learning, which promotes close representations for variants of the same sample and separation for different samples. The authors rely on overfitting to training data for verification, but as contrastive learning improves, this approach may be less effective. Enhanced contrastive learning might eventually generalize representations, clustering representations from single-sample into a si
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Data Quality and Management
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Bitcoin Customer Service Number +1-833-534-1729 · Attention Is All You Need · Average Pooling · Max Pooling · Convolution · Softmax · Normalized Temperature-scaled Cross Entropy Loss · Random Resized Crop · Kaiming Initialization
