Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis
Nabil Ibtehaz, S. M. Shakhawat Hossain Sourav, Md. Shamsuzzoha Bayzid,, M. Sohel Rahman

TL;DR
This paper introduces Align-gram, a novel $k$-mer embedding scheme for protein sequences that improves deep learning model performance by capturing biological similarities more effectively than traditional methods.
Contribution
The study proposes Align-gram, a new embedding scheme that incorporates biological insights into the Skip-gram model for enhanced protein sequence analysis.
Findings
Align-gram maps similar $k$-mers close in vector space.
Align-gram embeddings improve deep learning model training.
Demonstrated effectiveness on LSTM and CNN models.
Abstract
Background: The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the `language of life', has been analyzed for a multitude of applications and inferences. Motivation: Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. Results: We propose a novel -mer embedding scheme, Align-gram, which is capable of mapping the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Topic Modeling
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
