Embed-Search-Align: DNA Sequence Alignment using Transformer Models

Pavan Holur; K. C. Enevoldsen; Shreyas Rajesh; Lajoyce Mboning; Thalia; Georgiou; Louis-S. Bouchard; Matteo Pellegrini; Vwani Roychowdhury

arXiv:2309.11087·q-bio.GN·December 6, 2024

Embed-Search-Align: DNA Sequence Alignment using Transformer Models

Pavan Holur, K. C. Enevoldsen, Shreyas Rajesh, Lajoyce Mboning, Thalia, Georgiou, Louis-S. Bouchard, Matteo Pellegrini, Vwani Roychowdhury

PDF

Open Access 1 Models 3 Reviews

TL;DR

This paper introduces Embed-Search-Align, a Transformer-based method for DNA sequence alignment that uses embeddings and a vector store to achieve high accuracy, rivaling traditional tools like Bowtie and BWA-Mem.

Contribution

It presents a novel reference-free DNA embedding model and a search framework that significantly improves sequence alignment performance using Transformer architectures.

Findings

01

Achieves 99% accuracy on human genome reads

02

Outperforms existing Transformer models in DNA alignment tasks

03

Rivals conventional alignment tools like Bowtie and BWA-Mem

Abstract

DNA sequence alignment involves assigning short DNA reads to the most probable locations on an extensive reference genome. This process is crucial for various genomic analyses, including variant calling, transcriptomics, and epigenomics. Conventional methods, refined over decades, tackle this challenge in 2 steps: genome indexing followed by efficient search to locate likely positions for given reads. Building on the success of Large Language Models in encoding text into embeddings, where the distance metric captures semantic similarity, recent efforts have explored whether the same Transformer architecture can produce embeddings for DNA sequences. Such models have shown early promise in classifying short DNA sequences, such as detecting coding/non-coding regions, and enhancer, promoter sequences. However, performance at sequence classification tasks does not translate to sequence…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

+ The authors present a novel approach to align sequence reads which can provide new possibilities for DNA sequence representation and search. + The proposed DNA-ESA encoder learns effective sequence embeddings for alignment, and outperforms several baseline transformer models designed for specific genomics tasks. + The approach is promising and demonstrates ability to generalize to new sequences not seen during training, like different chromosomes and even new species. Furthermore, formulating

Weaknesses

- I felt that the paper is a very dense read for the general ML audience at ICLR for folks who do not have DNA sequencing background, and it will be great to make the paper more accessible. - The embedding approach currently shows promising results on simulated data, but needs more evaluation on real sequencing data. - The performance for short reads is worse than long reads, given that short reads are more commonly used, this may affect how this system can be actually used. - Limited demonstrat

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

NA

Weaknesses

NA

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

The problem being tacked is an important one, and the proposed architecture does appear to have some benefits over existing transformer models. The ability to handle varying fragment lengths makes the method very flexible.

Weaknesses

The experimental validation is the weak point of this paper. There are claims of efficiency yet no experimental evidence of any resource requirements and scaling such as time and memory. Furthermore, comparitive experiments are against existing transformer models and hence show that the embedding is superior to existing ones for alignment, but not that this is a good aligner overall. There are no comparisons against standard alignment algorithms (though they are referenced). There are no experim

Code & Models

Models

🤗
roychowdhuryresearch/dna2vec
model· 48 dl· ♡ 2
48 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · RNA and protein synthesis mechanisms

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Residual Connection · Adam · Linear Layer · Dropout