Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision
Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia, A. Lanman, Vaneet Aggarwal

TL;DR
This paper introduces ENBED, a byte-level encoder-decoder Transformer model for DNA sequences, enabling precise analysis and generation of genomic data with improved accuracy over existing models.
Contribution
The paper develops ENBED, a novel byte-level Transformer model with efficient attention, capable of sequence-to-sequence genomic analysis and mutation generation, surpassing prior models.
Findings
ENBED outperforms state-of-the-art models in genomic tasks.
Byte-level analysis preserves detailed sequence information.
The model effectively generates and validates viral mutations.
Abstract
This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · RNA and protein synthesis mechanisms · Genomics and Phylogenetic Studies
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Dense Connections · Adam · Layer Normalization · Label Smoothing · Linear Layer · Balanced Selection · Byte Pair Encoding
