Provable Separations between Memorization and Generalization in Diffusion Models
Zeqi Ye, Qijie Zhu, Molei Tao, Minshuo Chen

TL;DR
This paper provides a theoretical analysis of why diffusion models tend to memorize training data instead of generalizing, revealing fundamental separations and proposing a pruning method to reduce memorization without harming output quality.
Contribution
It introduces a dual-separation framework based on estimation and approximation perspectives, offering new theoretical insights into diffusion model memorization and generalization.
Findings
Ground-truth score function does not minimize empirical denoising loss
Implementing empirical score function requires large network size
Pruning reduces memorization while preserving generation quality
Abstract
Diffusion models have achieved remarkable success across diverse domains, but they remain vulnerable to memorization -- reproducing training data rather than generating novel outputs. This not only limits their creative potential but also raises concerns about privacy and safety. While empirical studies have explored mitigation strategies, theoretical understanding of memorization remains limited. We address this gap through developing a dual-separation result via two complementary perspectives: statistical estimation and network approximation. From the estimation side, we show that the ground-truth score function does not minimize the empirical denoising loss, creating a separation that drives memorization. From the approximation side, we prove that implementing the empirical score function requires network size to scale with sample size, spelling a separation compared to the more…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper presents a strong theoretical understanding of why memorization happens in diffusion models. - The paper shows that even with more training data, the loss gap will be present, showing that adding more training data to prevent memorization is not sufficient.
- In the experimental section, it is not mentioned which model architecture was used. - The experiments are only conducted on small datasets such as the CIFAR-10 dataset and a synthetic Gaussian mixture dataset - The pruning-based method is not evaluated against other SOTA pruning-based methods. Misc: - In line 95 there is a typo in "correspnding" - Line 103 "an" -> "a"
The paper provides a solid theoretical analysis necessary for understanding the phenomenon of memorization and generalization in diffusion models. It rigorously establishes, in a statistical sense, the loss separation between the true and empirical score functions on empirical data, and quantitatively characterizes the network complexity required to learn each by leveraging results from universal approximation theory, which are appreciable theoretical contributions. Moreover, the authors effecti
- **Unclear theoretical hypothesis and motivation.** It is not entirely clear what specific hypothesis the theory aims to formalize or explain. Judging from the experiments, the paper seems to intend to relate sample size, network size, and weight decay to memorization and generalization in diffusion models, but the central claim or insight remains vague. The analysis does show that limited sample size can cause the empirical loss to favor memorization over generalization, but this is well-known
1. The paper provides a rigorous theoretical analysis of diffusion models by quantifying the loss gap between the empirical and ground-truth scores under sub-Gaussian, Hölder-smooth data distributions, and further establishes an architectural separation showing that approximating the empirical score requires higher model complexity. 2. The paper proposes a practical one-shot pruning method for Diffusion Transformers that removes low-importance heads in the small-t regime, effectively reducing me
1. The analysis mainly focuses on the small-t regime, but the paper does not empirically verify whether memorization indeed concentrates in this phase during real generation — an experiment comparing large-t and small-t generations could strengthen the claims. 2. I think The theoretical results rely on strong assumptions such as sub-Gaussianity, which limit their applicability to real-world data. 3. The proposed pruning method is tested on a single dataset, its generalization to other datasets o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Privacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques
