30 Years of Synthetic Data
Joerg Drechsler, Anna-Carolina Haensch

TL;DR
This paper reviews 30 years of synthetic data research, highlighting historical developments, diverse methodologies, and strategies for utility and risk assessment, emphasizing its growing role in data privacy and accessibility.
Contribution
It provides a comprehensive overview of synthetic data evolution, methodologies, and evaluation strategies over three decades, connecting past insights with current practices.
Findings
Synthetic data has expanded significantly over 30 years.
Multiple approaches and metrics for utility and risk assessment exist.
Synthetic data enhances data privacy and accessibility.
Abstract
The idea to generate synthetic data as a tool for broadening access to sensitive microdata has been proposed for the first time three decades ago. While first applications of the idea emerged around the turn of the century, the approach really gained momentum over the last ten years, stimulated at least in parts by some recent developments in computer science. We consider the upcoming 30th jubilee of Rubin's seminal paper on synthetic data (Rubin, 1993) as an opportunity to look back at the historical developments, but also to offer a review of the diverse approaches and methodological underpinnings proposed over the years. We will also discuss the various strategies that have been suggested to measure the utility and remaining risk of disclosure of the generated data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Data Quality and Management · Human Mobility and Location-Based Analysis
