Unaligned Sequence Similarity Search Using Deep Learning
James K. Senter, Taylor M. Royalty, Andrew D. Steen, Amir Sadovnik

TL;DR
This paper introduces a deep learning-based embedding method for DNA and amino-acid sequences that enables faster similarity searches, better handling of unknown genes, and improved clustering and classification compared to traditional alignment-based methods.
Contribution
The authors propose a novel recurrent neural network embedding approach that correlates sequence distances with functional similarity, overcoming limitations of traditional alignment methods.
Findings
Embedding space enables fast similarity search using Euclidean distance.
Method improves clustering of unknown gene sequences.
Supports classification with labeled data and clustering of unlabeled data.
Abstract
Gene annotation has traditionally required direct comparison of DNA sequences between an unknown gene and a database of known ones using string comparison methods. However, these methods do not provide useful information when a gene does not have a close match in the database. In addition, each comparison can be costly when the database is large since it requires alignments and a series of string comparisons. In this work we propose a novel approach: using recurrent neural networks to embed DNA or amino-acid sequences in a low-dimensional space in which distances correlate with functional similarity. This embedding space overcomes both shortcomings of the method of aligning sequences and comparing homology. First, it allows us to obtain information about genes which do not have exact matches by measuring their similarity to other ones in the database. If our database is labeled this can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
