Radioactive data: tracing through training
Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Herv\'e, J\'egou

TL;DR
This paper introduces 'radioactive data,' a technique to embed an imperceptible, robust mark into datasets that can be detected in trained models, enabling verification of dataset usage even with minimal radioactive data.
Contribution
The paper presents a novel method for embedding and detecting dataset provenance marks in trained models, robust against various training variations and data augmentations.
Findings
High-confidence detection (p<10^-4) with only 1% radioactive data
Robust detection across different architectures and training procedures
Effective even with data augmentation and stochastic training noise
Abstract
We want to detect whether a particular image dataset has been used to train a model. We propose a new technique, \emph{radioactive data}, that makes imperceptible changes to this dataset such that any model trained on it will bear an identifiable mark. The mark is robust to strong variations such as different architectures or optimization methods. Given a trained model, our technique detects the use of radioactive data and provides a level of confidence (p-value). Our experiments on large-scale benchmarks (Imagenet), using standard architectures (Resnet-18, VGG-16, Densenet-121) and training procedures, show that we can detect usage of radioactive data with high confidence (p<10^-4) even when only 1% of the data used to trained our model is radioactive. Our method is robust to data augmentation and the stochasticity of deep network optimization. As a result, it offers a much higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · COVID-19 diagnosis using AI
