A Closer Look on Memorization in Tabular Diffusion Model: A Data-Centric Perspective
Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiaoge Zhang, Kaiyu Tang, Xiao Li, Jing Li

TL;DR
This paper investigates how individual samples contribute to memorization in tabular diffusion models, revealing a small subset causes most privacy leakage, and proposes DynamicCut to mitigate this effectively.
Contribution
It provides the first data-centric analysis of memorization dynamics in tabular diffusion models and introduces DynamicCut, a novel mitigation method that reduces memorization while preserving data utility.
Findings
Memorization follows a heavy-tailed distribution with few samples causing most leakage.
Memorized samples are identified early and show stronger signals in initial training phases.
DynamicCut effectively reduces memorization with minimal impact on data quality across models.
Abstract
Diffusion models have shown strong performance in generating high-quality tabular data, but they carry privacy risks by reproducing exact training samples. While prior work focuses on dataset-level augmentation to reduce memorization, little is known about which individual samples contribute most. We present the first data-centric study of memorization dynamics in tabular diffusion models. We quantify memorization for each real sample based on how many generated samples are flagged as replicas, using a relative distance ratio. Our empirical analysis reveals a heavy-tailed distribution of memorization counts: a small subset of samples contributes disproportionately to leakage, confirmed via sample-removal experiments. To understand this, we divide real samples into top- and non-top-memorized groups and analyze their training-time behaviors. We track when each sample is first memorized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Generative Adversarial Networks and Image Synthesis · Stochastic Gradient Optimization Techniques
