Embed-Search-Align: DNA Sequence Alignment using Transformer Models
Pavan Holur, K. C. Enevoldsen, Shreyas Rajesh, Lajoyce Mboning, Thalia, Georgiou, Louis-S. Bouchard, Matteo Pellegrini, Vwani Roychowdhury

TL;DR
This paper introduces Embed-Search-Align, a Transformer-based method for DNA sequence alignment that uses embeddings and a vector store to achieve high accuracy, rivaling traditional tools like Bowtie and BWA-Mem.
Contribution
It presents a novel reference-free DNA embedding model and a search framework that significantly improves sequence alignment performance using Transformer architectures.
Findings
Achieves 99% accuracy on human genome reads
Outperforms existing Transformer models in DNA alignment tasks
Rivals conventional alignment tools like Bowtie and BWA-Mem
Abstract
DNA sequence alignment involves assigning short DNA reads to the most probable locations on an extensive reference genome. This process is crucial for various genomic analyses, including variant calling, transcriptomics, and epigenomics. Conventional methods, refined over decades, tackle this challenge in 2 steps: genome indexing followed by efficient search to locate likely positions for given reads. Building on the success of Large Language Models in encoding text into embeddings, where the distance metric captures semantic similarity, recent efforts have explored whether the same Transformer architecture can produce embeddings for DNA sequences. Such models have shown early promise in classifying short DNA sequences, such as detecting coding/non-coding regions, and enhancer, promoter sequences. However, performance at sequence classification tasks does not translate to sequence…
Peer Reviews
Decision·Submitted to ICLR 2024
+ The authors present a novel approach to align sequence reads which can provide new possibilities for DNA sequence representation and search. + The proposed DNA-ESA encoder learns effective sequence embeddings for alignment, and outperforms several baseline transformer models designed for specific genomics tasks. + The approach is promising and demonstrates ability to generalize to new sequences not seen during training, like different chromosomes and even new species. Furthermore, formulating
- I felt that the paper is a very dense read for the general ML audience at ICLR for folks who do not have DNA sequencing background, and it will be great to make the paper more accessible. - The embedding approach currently shows promising results on simulated data, but needs more evaluation on real sequencing data. - The performance for short reads is worse than long reads, given that short reads are more commonly used, this may affect how this system can be actually used. - Limited demonstrat
NA
NA
The problem being tacked is an important one, and the proposed architecture does appear to have some benefits over existing transformer models. The ability to handle varying fragment lengths makes the method very flexible.
The experimental validation is the weak point of this paper. There are claims of efficiency yet no experimental evidence of any resource requirements and scaling such as time and memory. Furthermore, comparitive experiments are against existing transformer models and hence show that the embedding is superior to existing ones for alignment, but not that this is a good aligner overall. There are no comparisons against standard alignment algorithms (though they are referenced). There are no experim
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · RNA and protein synthesis mechanisms
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Residual Connection · Adam · Linear Layer · Dropout
