Understanding and Mitigating Memorization in Diffusion Models for Tabular Data
Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiao Li, Jing Li

TL;DR
This paper investigates memorization in diffusion models for tabular data, revealing its occurrence and proposing novel augmentation techniques, TabCutMix and TabCutMixPlus, to mitigate memorization and improve data generation quality.
Contribution
It provides the first comprehensive analysis of memorization in tabular diffusion models and introduces effective augmentation methods to address this issue.
Findings
Memorization increases with training epochs in tabular diffusion models.
TabCutMix and TabCutMixPlus effectively reduce memorization.
Proposed methods maintain high-quality data generation.
Abstract
Tabular data generation has attracted significant research interest in recent years, with the tabular diffusion models greatly improving the quality of synthetic data. However, while memorization, where models inadvertently replicate exact or near-identical training data, has been thoroughly investigated in image and text generation, its effects on tabular data remain largely unexplored. In this paper, we conduct the first comprehensive investigation of memorization phenomena in diffusion models for tabular data. Our empirical analysis reveals that memorization appears in tabular diffusion models and increases with larger training epochs. We further examine the influence of factors such as dataset sizes, feature dimensions, and different diffusion models on memorization. Additionally, we provide a theoretical explanation for why memorization occurs in tabular diffusion models. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Advanced Database Systems and Queries · Neural Networks and Applications
MethodsDiffusion
