End to End Collaborative Synthetic Data Generation
Sikha Pentyala, Geetha Sitaraman, Trae Claar, Martine De Cock

TL;DR
This paper introduces an end-to-end collaborative framework for generating and publishing synthetic data that preserves privacy, enabling multiple data custodians to collaboratively produce useful synthetic datasets, exemplified with genomic data for leukemia.
Contribution
It presents a novel end-to-end framework for collaborative synthetic data generation that includes preprocessing, hyperparameter tuning, and evaluation, using secure multiparty computation protocols.
Findings
Framework effectively enables privacy-preserving collaborative data sharing.
Successful application to synthetic genomic data for leukemia.
Demonstrates feasibility of end-to-end synthetic data pipeline.
Abstract
The success of AI is based on the availability of data to train models. While in some cases a single data custodian may have sufficient data to enable AI, often multiple custodians need to collaborate to reach a cumulative size required for meaningful AI research. The latter is, for example, often the case for rare diseases, with each clinical site having data for only a small number of patients. Recent algorithms for federated synthetic data generation are an important step towards collaborative, privacy-preserving data sharing. Existing techniques, however, focus exclusively on synthesizer training, assuming that the training data is already preprocessed and that the desired synthetic data can be delivered in one shot, without any hyperparameter tuning. In this paper, we propose an end-to-end collaborative framework for publishing of synthetic data that accounts for privacy-preserving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Scientific Computing and Data Management · Advanced Database Systems and Queries
MethodsFocus
