In Defense of Synthetic Data
Luke Rodriguez, Bill Howe

TL;DR
This paper advocates for the use of carefully generated synthetic data as a responsible alternative to real data, emphasizing its benefits for privacy, bias correction, and equitable research in machine learning.
Contribution
It challenges the negative perception of synthetic data, highlighting its advantages and advocating for its role in responsible AI development.
Findings
Synthetic data enhances privacy and bias correction.
Curated synthetic datasets support early product development.
Properly generated synthetic data promotes responsible AI.
Abstract
Synthetic datasets have long been thought of as second-rate, to be used only when "real" data collected directly from the real world is unavailable. But this perspective assumes that raw data is clean, unbiased, and trustworthy, which it rarely is. Moreover, the benefits of synthetic data for privacy and for bias correction are becoming increasingly important in any domain that works with people. Curated synthetic datasets - synthetic data derived from minimal perturbations of real data - enable early stage product development and collaboration, protect privacy, afford reproducibility, increase dataset diversity in research, and protect disadvantaged groups from problematic inferences on the original data that reflects systematic discrimination. Rather than representing a departure from the true state of the world, in this paper we argue that properly generated synthetic data is a step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Ethics and Social Impacts of AI · Mobile Crowdsensing and Crowdsourcing
