To democratize research with sensitive data, we should make synthetic   data more accessible

Erik-Jan van Kesteren

arXiv:2404.17271·stat.OT·November 7, 2024

To democratize research with sensitive data, we should make synthetic data more accessible

Erik-Jan van Kesteren

PDF

Open Access

TL;DR

This paper advocates for making synthetic data more accessible through tools, education, and case studies to promote open and reproducible research with sensitive data, rather than solely improving synthesis methods.

Contribution

It emphasizes shifting focus from developing synthesis techniques to enhancing accessibility, education, and practical case studies for wider adoption of synthetic data.

Findings

01

Synthetic data has potential but limited adoption.

02

Accessibility and education are key to adoption.

03

Small-scale case studies can demonstrate utility.

Abstract

For over 30 years, synthetic data has been heralded as a promising solution to make sensitive datasets accessible. However, despite much research effort and several high-profile use-cases, the widespread adoption of synthetic data as a tool for open, accessible, reproducible research with sensitive data is still a distant dream. In this opinion, Erik-Jan van Kesteren, head of the ODISSEI Social Data Science team, argues that in order to progress towards widespread adoption of synthetic data as a privacy enhancing technology, the data science research community should shift focus away from developing better synthesis methods: instead, it should develop accessible tools, educate peers, and publish small-scale case studies.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management