PROVGEN: A Privacy-Preserving Approach for Outcome Validation in Genomic Research
Yuzhou Jiang, Tianxi Ji, Erman Ayday

TL;DR
PROVGEN is a novel privacy-preserving method for sharing genomic datasets that enhances reproducibility and validation in GWAS by balancing data utility and privacy protection.
Contribution
It introduces a two-stage differential privacy approach with data utility adjustment for genomic data sharing, improving over existing methods.
Findings
Outperforms existing methods in GWAS error detection
Provides higher privacy protection against membership inference attacks
Maintains better data utility for genomic research
Abstract
As genomic research has grown increasingly popular in recent years, dataset sharing has remained limited due to privacy concerns. This limitation hinders the reproducibility and validation of research outcomes, both of which are essential for identifying computational errors during the research process. In this paper, we introduce PROVGEN, a privacy-preserving method for sharing genomic datasets that facilitates reproducibility and outcome validation in genome-wide association studies (GWAS). Our approach encodes genomic data into binary space and applies a two-stage process. First, we generate a differentially private version of the dataset using an XOR-based mechanism that incorporates biological characteristics. Second, we restore data utility by adjusting the Minor Allele Frequency (MAF) values in the noisy dataset to align with published MAFs using optimal transport. Finally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data
