D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation
Zhao Yang, Hengchang Liu, Chuan Cao, Bing Su

TL;DR
D3LM introduces a novel bidirectional DNA language model using masked diffusion, enabling both understanding and generation of DNA sequences with improved performance and practical insights for future research.
Contribution
It unifies bidirectional DNA understanding and generation through masked diffusion, a new training paradigm for DNA language models, and provides systematic empirical analysis of design choices.
Findings
Improved understanding task performance over NT v2.
Achieved SFID of 10.92 on regulatory element generation.
Provided empirical insights on tokenization and sampling strategies.
Abstract
Early DNA foundation models adopted BERT-style training, achieving good performance on DNA understanding tasks but lacking generative capabilities. Recent autoregressive models enable DNA generation, but employ left-to-right causal modeling that is suboptimal for DNA where regulatory relationships are inherently bidirectional. We present D3LM (\textbf{D}iscrete \textbf{D}NA \textbf{D}iffusion \textbf{L}anguage \textbf{M}odel), which unifies bidirectional representation learning and DNA generation through masked diffusion. D3LM directly adopts the Nucleotide Transformer (NT) v2 architecture but reformulates the training objective as masked diffusion in discrete DNA space, enabling both bidirectional understanding and generation capabilities within a single model. Compared to NT v2 of the same size, D3LM achieves improved performance on understanding tasks. Notably, on regulatory element…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDNA and Biological Computing · Genomics and Chromatin Dynamics · DNA and Nucleic Acid Chemistry
