# Privacy-preserving verification of preprocessing in federated learning for genomic data

**Authors:** Wenbiao Li, Anisa Halimi, Jaideep Vaidya, Xiaoqian Jiang, Erman Ayday

PMC · DOI: 10.1093/jamiaopen/ooag040 · 2026-03-26

## TL;DR

This paper introduces a privacy-preserving method to verify that different institutions in a genomic study applied the same preprocessing steps without revealing raw data.

## Contribution

A novel framework using LDP and LIME explanations to audit preprocessing pipelines in federated learning for genomics.

## Key findings

- The verifier achieved 80% accuracy in centralized simulations while maintaining strong privacy guarantees.
- Binary compatibility detection reached 70% accuracy in distributed federated learning experiments with three sites.
- Differentially private explanation vectors serve as auditable fingerprints for preprocessing configurations.

## Abstract

To verify that federated genomic study sites applied identical preprocessing pipelines without disclosing raw genotypes.

Each institution perturbs a 100-SNP slice using local differential privacy (LDP), trains a RandomForest classifier, and transmits one LIME explanation vector to a coordinating server. The server simulates 15 preprocessing combinations and trains a RandomForest classifier to predict each site’s configuration.

In centralized simulation, the verifier achieved 80% accuracy across 15 preprocessing configurations on the GMMAT (n = 400) and synthetic genome (n = 2504) datasets while maintaining membership-inference attack power below 0.05 at ε = 3. In distributed Flower FL experiments with data partitioned across three sites, binary compatibility detection reached 70% accuracy at 500 SNPs.

A single differentially private explanation vector provides an auditable preprocessing fingerprint. The gap between centralized and distributed accuracy reflects expected FL data partitioning effects.

This framework demonstrates the feasibility of automated preprocessing verification in federated genomic consortia without compromising participant privacy.

## Full-text entities

- **Genes:** FLT3LG (fms related receptor tyrosine kinase 3 ligand) [NCBI Gene 2323] {aka FL, FLG3L, FLT3L, IMD125}
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13019136/full.md

---
Source: https://tomesphere.com/paper/PMC13019136