Latent Diffusion Model for DNA Sequence Generation
Zehui Li, Yuhao Ni, Tim August B. Huygelen, Akashaditya Das, Guoxuan, Xia, Guy-Bart Stan, Yiren Zhao

TL;DR
This paper introduces DiscDiff, a latent diffusion model for generating synthetic DNA sequences by embedding discrete data into a continuous space, achieving high-quality, diverse samples and providing a new evaluation metric.
Contribution
The paper presents a novel latent diffusion approach for DNA sequence generation and introduces FReD as a new metric for assessing sample quality.
Findings
DiscDiff generates DNA sequences closely matching real data in motif and chromatin profiles.
The model achieves high diversity and quality in synthetic DNA generation.
A new cross-species DNA dataset is provided for future research.
Abstract
The harnessing of machine learning, especially deep generative models, has opened up promising avenues in the field of synthetic DNA sequence generation. Whilst Generative Adversarial Networks (GANs) have gained traction for this application, they often face issues such as limited sample diversity and mode collapse. On the other hand, Diffusion Models are a promising new class of generative models that are not burdened with these problems, enabling them to reach the state-of-the-art in domains such as image generation. In light of this, we propose a novel latent diffusion model, DiscDiff, tailored for discrete DNA sequence generation. By simply embedding discrete DNA sequences into a continuous latent space using an autoencoder, we are able to leverage the powerful generative abilities of continuous diffusion models for the generation of discrete data. Additionally, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Genetics, Bioinformatics, and Biomedical Research · Machine Learning in Bioinformatics
MethodsALIGN · Diffusion
