SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion
Andrea Lampis, Michela Carlotta Massi, Nicola Pirastu, Francesca Ieva, Matteo Matteucci, Emanuele Di Angelantonio

TL;DR
SNPgen is a novel two-stage framework that generates phenotype-supervised synthetic genotypes using latent diffusion, enabling privacy-preserving data sharing while maintaining utility for genomic prediction tasks.
Contribution
It introduces a conditional latent diffusion approach combined with GWAS-guided variant selection for realistic, phenotype-aligned synthetic genotype generation.
Findings
Synthetic data matches real-data predictive performance.
High allele frequency correlation with source data.
Zero identical matches, ensuring privacy protection.
Abstract
Polygenic risk scores and other genomic analyses require large individual-level genotype datasets, yet strict data access restrictions impede sharing. Synthetic genotype generation offers a privacy-preserving alternative, but most existing methods operate unconditionally, producing samples without phenotype alignment, or rely on unsupervised compression, creating a gap between statistical fidelity and downstream task utility. We present SNPgen, a two-stage conditional latent diffusion framework for generating phenotype-supervised synthetic genotypes. SNPgen combines GWAS-guided variant selection (1,024-2,048 trait-associated SNPs) with a variational autoencoder for genotype compression and a latent diffusion model conditioned on binary disease labels via classifier-free guidance. Evaluated on 458,724 UK Biobank individuals across four complex diseases (coronary artery disease, breast…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetic Associations and Epidemiology · Advanced Causal Inference Techniques · Bioinformatics and Genomic Networks
