Privacy-preserving verification of preprocessing in federated learning for genomic data
Wenbiao Li, Anisa Halimi, Jaideep Vaidya, Xiaoqian Jiang, Erman Ayday

TL;DR
This paper introduces a privacy-preserving method to verify that different institutions in a genomic study applied the same preprocessing steps without revealing raw data.
Contribution
A novel framework using LDP and LIME explanations to audit preprocessing pipelines in federated learning for genomics.
Findings
The verifier achieved 80% accuracy in centralized simulations while maintaining strong privacy guarantees.
Binary compatibility detection reached 70% accuracy in distributed federated learning experiments with three sites.
Differentially private explanation vectors serve as auditable fingerprints for preprocessing configurations.
Abstract
To verify that federated genomic study sites applied identical preprocessing pipelines without disclosing raw genotypes. Each institution perturbs a 100-SNP slice using local differential privacy (LDP), trains a RandomForest classifier, and transmits one LIME explanation vector to a coordinating server. The server simulates 15 preprocessing combinations and trains a RandomForest classifier to predict each site’s configuration. In centralized simulation, the verifier achieved 80% accuracy across 15 preprocessing configurations on the GMMAT (n = 400) and synthetic genome (n = 2504) datasets while maintaining membership-inference attack power below 0.05 at ε = 3. In distributed Flower FL experiments with data partitioned across three sites, binary compatibility detection reached 70% accuracy at 500 SNPs. A single differentially private explanation vector provides an auditable…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Genetic Associations and Epidemiology · Ethics in Clinical Research
