Extracting Training Data from Unconditional Diffusion Models
Yunhao Chen, Xingjun Ma, Difan Zou, Yu-Gang Jiang

TL;DR
This paper develops a theoretical framework and new methods for extracting training data from diffusion models, revealing their memorization properties and improving data recovery techniques.
Contribution
It introduces a theoretical analysis of memorization in diffusion models and proposes SIDE, a novel data extraction method that outperforms previous approaches.
Findings
SIDE extracts data from unconditional diffusion models where prior methods fail
Theoretical analysis provides new insights into memorization in diffusion models
SIDE achieves over 50% higher effectiveness on CelebA dataset
Abstract
As diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI), the study of their memorization of the raw training data has attracted growing attention. Existing works in this direction aim to establish an understanding of whether or to what extent DPMs learn by memorization. Such an understanding is crucial for identifying potential risks of data leakage and copyright infringement in diffusion models and, more importantly, for more controllable generation and trustworthy application of Artificial Intelligence Generated Content (AIGC). While previous works have made important observations of when DPMs are prone to memorization, these findings are mostly empirical, and the developed data extraction methods only work for conditional diffusion models. In this work, we aim to establish a theoretical understanding of memorization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning
MethodsDiffusion
