Distributed Representations for Biological Sequence Analysis
Dhananjay Kimothi, Akshay Soni, Pravesh Biyani, James M. Hogan

TL;DR
This paper introduces seq2vec, a novel method for embedding biological sequences into a Euclidean space, enhancing sequence comparison by capturing contextual information, and demonstrates its effectiveness in protein classification and retrieval tasks.
Contribution
The paper presents seq2vec, a new embedding technique inspired by text document models, specifically designed for biological sequences, improving their representation for machine learning applications.
Findings
Encouraging results in protein sequence classification
Effective retrieval performance demonstrated
High-quality embeddings capturing sequence context
Abstract
Biological sequence comparison is a key step in inferring the relatedness of various organisms and the functional similarity of their components. Thanks to the Next Generation Sequencing efforts, an abundance of sequence data is now available to be processed for a range of bioinformatics applications. Embedding a biological sequence over a nucleotide or amino acid alphabet in a lower dimensional vector space makes the data more amenable for use by current machine learning tools, provided the quality of embedding is high and it captures the most meaningful information of the original sequences. Motivated by recent advances in the text document embedding literature, we present a new method, called seq2vec, to represent a complete biological sequence in an Euclidean space. The new representation has the potential to capture the contextual information of the original sequence necessary for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Biomedical Text Mining and Ontologies
