Radioactive data: tracing through training

Alexandre Sablayrolles; Matthijs Douze; Cordelia Schmid; Herv\'e; J\'egou

arXiv:2002.00937·stat.ML·February 6, 2020·33 cites

Radioactive data: tracing through training

Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Herv\'e, J\'egou

PDF

Open Access 2 Repos 2 Videos

TL;DR

This paper introduces 'radioactive data,' a technique to embed an imperceptible, robust mark into datasets that can be detected in trained models, enabling verification of dataset usage even with minimal radioactive data.

Contribution

The paper presents a novel method for embedding and detecting dataset provenance marks in trained models, robust against various training variations and data augmentations.

Findings

01

High-confidence detection (p<10^-4) with only 1% radioactive data

02

Robust detection across different architectures and training procedures

03

Effective even with data augmentation and stochastic training noise

Abstract

We want to detect whether a particular image dataset has been used to train a model. We propose a new technique, \emph{radioactive data}, that makes imperceptible changes to this dataset such that any model trained on it will bear an identifiable mark. The mark is robust to strong variations such as different architectures or optimization methods. Given a trained model, our technique detects the use of radioactive data and provides a level of confidence (p-value). Our experiments on large-scale benchmarks (Imagenet), using standard architectures (Resnet-18, VGG-16, Densenet-121) and training procedures, show that we can detect usage of radioactive data with high confidence (p<10^-4) even when only 1% of the data used to trained our model is radioactive. Our method is robust to data augmentation and the stochasticity of deep network optimization. As a result, it offers a much higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Radioactive data: tracing through training (Paper Explained)· youtube

Radioactive data: tracing through training· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · COVID-19 diagnosis using AI