Generating Synthetic Genotypes using Diffusion Models
Philip Kenneweg, Raghuram Dandinasivara, Xiao Luo, Barbara, Hammer, Alexander Sch\"onhuth

TL;DR
This paper presents a diffusion model that generates realistic synthetic human genotypes, enabling privacy-preserving data sharing and improving classifier performance in biomedical research.
Contribution
The first diffusion model for generating complete synthetic human genotypes that can be expanded into full genomes, maintaining data utility and privacy.
Findings
Synthetic genotypes mimic real data without reproducing known genotypes.
Classifiers trained on synthetic data achieve near real-data accuracy.
Augmenting real data with synthetic genotypes enhances model performance.
Abstract
In this paper, we introduce the first diffusion model designed to generate complete synthetic human genotypes, which, by standard protocols, one can straightforwardly expand into full-length, DNA-level genomes. The synthetic genotypes mimic real human genotypes without just reproducing known genotypes, in terms of approved metrics. When training biomedically relevant classifiers with synthetic genotypes, accuracy is near-identical to the accuracy achieved when training classifiers with real data. We further demonstrate that augmenting small amounts of real with synthetically generated genotypes drastically improves performance rates. This addresses a significant challenge in translational human genetics: real human genotypes, although emerging in large volumes from genome wide association studies, are sensitive private data, which limits their public availability. Therefore, the…
Peer Reviews
Decision·Submitted to ICLR 2025
This article is very exciting. It is incredibly well-written with a clear and enjoyable prose. It tackles an important issue, human genome analysis, with state of the art methods, diffusion models. It presents an interesting comparative study for deep learning in a data regime not often considered, especially in the diffusion literature, but of great import and scientific interest. The insights from the architecture comparison study are useful for a machine learning readership beyond the biomedi
It would have been helpful to contextualize the results in comparison to other methods, like the work of Szatkownik et al, or HyenaDNA. It is unclear to me how feasible the application of these methods to the full human genome is, but this could have been discussed in the results. However, it should be possible to generate smaller sequences with the proposed diffusion method in order to compare to previous works. The experimental results, as they are, demonstrate the viability of diffusion model
1. The paper is well-written and presents a technically sound and well-structured approach. 2. Using diffusion models to generate full-length genotypes is a novel approach. The rationale and motivation for this approach are clearly stated. 2. The paper presents a significant contribution to addressing the challenges with data access restrictions and privacy in genome data. 3. The paper presents empirical evidence for comparable performance between the models trained with synthetic and real d
1. Authors should discuss potential limitations of their current evaluation approach in more detail. 2. Could authors discuss any limitations in generalizing their approach to other genomic datasets?
The paper is relatively clear although the language and grammar could use polishing. The claimed contribution is clear and the authors discuss prior work clearly. For the most part the paper is easy to follow. The problem of synthetic data generation is a significant one and the development of synthetic human genomes is surely of interest to some in the research community, although I do not work in this area.
Although the authors discuss prior work, they do not seem to compare to modeling approaches used in it. It is not clear if this is because of technical limitations or an oversight. The motivation for generating synthetic genotypes to improve ML models is clear, but the authors do not spend enough time discussing the possible negative ethical implications of generating human genotypes. I appreciated the evaluation of privacy loss but the implications were not fleshed out. Specific comments: -
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification
