In Defense of Synthetic Data

Luke Rodriguez; Bill Howe

arXiv:1905.01351·cs.DB·May 7, 2019·1 cites

In Defense of Synthetic Data

Luke Rodriguez, Bill Howe

PDF

Open Access

TL;DR

This paper advocates for the use of carefully generated synthetic data as a responsible alternative to real data, emphasizing its benefits for privacy, bias correction, and equitable research in machine learning.

Contribution

It challenges the negative perception of synthetic data, highlighting its advantages and advocating for its role in responsible AI development.

Findings

01

Synthetic data enhances privacy and bias correction.

02

Curated synthetic datasets support early product development.

03

Properly generated synthetic data promotes responsible AI.

Abstract

Synthetic datasets have long been thought of as second-rate, to be used only when "real" data collected directly from the real world is unavailable. But this perspective assumes that raw data is clean, unbiased, and trustworthy, which it rarely is. Moreover, the benefits of synthetic data for privacy and for bias correction are becoming increasingly important in any domain that works with people. Curated synthetic datasets - synthetic data derived from minimal perturbations of real data - enable early stage product development and collaboration, protect privacy, afford reproducibility, increase dataset diversity in research, and protect disadvantaged groups from problematic inferences on the original data that reflects systematic discrimination. Rather than representing a departure from the true state of the world, in this paper we argue that properly generated synthetic data is a step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Ethics and Social Impacts of AI · Mobile Crowdsensing and Crowdsourcing