Synthetic Data: Methods, Use Cases, and Risks
Emiliano De Cristofaro

TL;DR
Synthetic data offers a promising way to share useful datasets while protecting privacy, but it faces challenges and limitations that need careful consideration.
Contribution
This paper provides an introductory overview of synthetic data, discussing its applications, privacy concerns, and inherent limitations as a privacy-preserving technology.
Findings
Synthetic data can enable data sharing without exposing sensitive information.
There are significant privacy challenges and unaddressed risks associated with synthetic data.
Synthetic data has limitations that restrict its effectiveness as a privacy-enhancing solution.
Abstract
Sharing data can often enable compelling applications and analytics. However, more often than not, valuable datasets contain information of a sensitive nature, and thus, sharing them can endanger the privacy of users and organizations. A possible alternative gaining momentum in both the research community and industry is to share synthetic data instead. The idea is to release artificially generated datasets that resemble the actual data -- more precisely, having similar statistical properties. In this article, we provide a gentle introduction to synthetic data and discuss its use cases, the privacy challenges that are still unaddressed, and its inherent limitations as an effective privacy-enhancing technology.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
