Synthetic Tabular Data: Methods, Attacks and Defenses
Graham Cormode, Samuel Maddock, Enayat Ullah, Shripad Gade

TL;DR
This survey reviews recent advances in synthetic tabular data generation, discussing methods, privacy attacks, defenses, and open challenges to improve data utility and privacy preservation.
Contribution
It provides a comprehensive overview of key methodologies, attacks, and defenses in synthetic tabular data generation, highlighting current limitations and future research directions.
Findings
Probabilistic graphical models and deep learning are main paradigms.
Synthetic data can pose privacy risks through information retrieval attacks.
Open problems include balancing data utility and privacy protection.
Abstract
Synthetic data is often positioned as a solution to replace sensitive fixed-size datasets with a source of unlimited matching data, freed from privacy concerns. There has been much progress in synthetic data generation over the last decade, leveraging corresponding advances in machine learning and data analytics. In this survey, we cover the key developments and the main concepts in tabular synthetic data generation, including paradigms based on probabilistic graphical models and on deep learning. We provide background and motivation, before giving a technical deep-dive into the methodologies. We also address the limitations of synthetic data, by studying attacks that seek to retrieve information about the original sensitive data. Finally, we present extensions and open problems in this area.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
