DiscDiff: Latent Diffusion Model for DNA Sequence Generation
Zehui Li, Yuhao Ni, William A V Beardall, Guoxuan Xia, Akashaditya, Das, Guy-Bart Stan, Yiren Zhao

TL;DR
This paper presents DiscDiff, a latent diffusion model for generating realistic DNA sequences, enhanced by the Absorb-Escape algorithm, and introduces a new multi-species DNA dataset to advance genetic sequence modeling.
Contribution
The paper introduces DiscDiff, a novel latent diffusion framework for DNA generation, and Absorb-Escape, a post-training correction method, along with a comprehensive multi-species DNA dataset.
Findings
DiscDiff outperforms existing diffusion models in DNA sequence generation.
Absorb-Escape improves sequence realism by correcting conversion errors.
The new dataset includes 160,000 sequences from 15 species.
Abstract
This paper introduces a novel framework for DNA sequence generation, comprising two key components: DiscDiff, a Latent Diffusion Model (LDM) tailored for generating discrete DNA sequences, and Absorb-Escape, a post-training algorithm designed to refine these sequences. Absorb-Escape enhances the realism of the generated sequences by correcting `round errors' inherent in the conversion process between latent and input spaces. Our approach not only sets new standards in DNA sequence generation but also demonstrates superior performance over existing diffusion models, in generating both short and long DNA sequences. Additionally, we introduce EPD-GenDNA, the first comprehensive, multi-species dataset for DNA generation, encompassing 160,000 unique sequences from 15 species. We hope this study will advance the generative modelling of DNA, with potential implications for gene therapy and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
MethodsDiffusion · Latent Diffusion Model
