# Data-Copying in Generative Models: A Formal Framework

**Authors:** Robi Bhattacharjee, Sanjoy Dasgupta, Kamalika Chaudhuri

arXiv: 2302.13181 · 2023-03-03

## TL;DR

This paper refines the concept of data-copying in generative models, proposing a more locally sensitive detection method with theoretical guarantees and sample complexity bounds.

## Contribution

It introduces an improved, locally-aware definition of data-copying and a detection method with proven high-probability effectiveness and sample complexity bounds.

## Key findings

- The new detection method is effective with high probability given sufficient data.
- Lower bounds on sample size necessary for reliable detection are established.
- The framework addresses limitations of previous global detection approaches.

## Abstract

There has been some recent interest in detecting and addressing memorization of training data by deep neural networks. A formal framework for memorization in generative models, called "data-copying," was proposed by Meehan et. al. (2020). We build upon their work to show that their framework may fail to detect certain kinds of blatant memorization. Motivated by this and the theory of non-parametric methods, we provide an alternative definition of data-copying that applies more locally. We provide a method to detect data-copying, and provably show that it works with high probability when enough data is available. We also provide lower bounds that characterize the sample requirement for reliable detection.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.13181/full.md

## Figures

28 figures with captions in the complete paper: https://tomesphere.com/paper/2302.13181/full.md

## References

17 references — full list in the complete paper: https://tomesphere.com/paper/2302.13181/full.md

---
Source: https://tomesphere.com/paper/2302.13181