Secure Cross-Silo Synthetic Genomic Data Generation
Daniil Filienko, Martine De Cock, Sikha Pentyala

TL;DR
This paper introduces a method combining secure multiparty computation and differential privacy to generate high-utility synthetic genomic data across multiple institutions without exposing sensitive raw data.
Contribution
It presents a novel federated approach for privacy-preserving synthetic genomic data generation using MPC and DP, enabling collaboration without data sharing.
Findings
Successfully generated synthetic RNA-seq datasets from multiple cohorts
Preserved data utility while ensuring privacy in federated settings
Demonstrated effectiveness in real-world genomic data scenarios
Abstract
Access to genomic data is highly regulated due to its sensitive nature. While safeguards are essential, cumbersome data access processes pose a significant barrier to the development of AI methods for genomics. Synthetic data generation can mitigate this tension by enabling broader data sharing without exposing sensitive information. Synthetic genomic data are produced by training generative models on real data and subsequently sampling artificial data that preserves relevant statistics while limiting disclosures about the underlying individuals. In some settings, a single data holder may have sufficient data to train such generative models; however, in many applications data must be combined across multiple sites to achieve adequate scale. This need arises, e.g., in rare disease studies, where individual hospitals typically hold data for only a small number of patients. The solution we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
