Real Risks of Fake Data: Synthetic Data, Diversity-Washing and Consent Circumvention
Cedric Deslandes Whitney, Justin Norman

TL;DR
This paper discusses the significant risks associated with synthetic data in machine learning, including false confidence in dataset diversity and potential circumvention of user consent, which complicate ethical governance.
Contribution
It identifies and analyzes two key risks of synthetic data—overconfidence in diversity and consent circumvention—highlighting their implications for ethics and governance.
Findings
Synthetic data can lead to false confidence in model evaluation.
Using synthetic data may bypass user consent regulations.
Synthetic data can concentrate power away from impacted communities.
Abstract
Machine learning systems require representations of the real world for training and testing - they require data, and lots of it. Collecting data at scale has logistical and ethical challenges, and synthetic data promises a solution to these challenges. Instead of needing to collect photos of real people's faces to train a facial recognition system, a model creator could create and use photo-realistic, synthetic faces. The comparative ease of generating this synthetic data rather than relying on collecting data has made it a common practice. We present two key risks of using synthetic data in model development. First, we detail the high risk of false confidence when using synthetic data to increase dataset diversity and representation. We base this in the examination of a real world use-case of synthetic data, where synthetic datasets were generated for an evaluation of facial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsBalanced Selection
