Align-gram : Rethinking the Skip-gram Model for Protein Sequence   Analysis

Nabil Ibtehaz; S. M. Shakhawat Hossain Sourav; Md. Shamsuzzoha Bayzid,; M. Sohel Rahman

arXiv:2012.03324·q-bio.QM·December 8, 2020

Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis

Nabil Ibtehaz, S. M. Shakhawat Hossain Sourav, Md. Shamsuzzoha Bayzid,, M. Sohel Rahman

PDF

Open Access 1 Repo

TL;DR

This paper introduces Align-gram, a novel $k$-mer embedding scheme for protein sequences that improves deep learning model performance by capturing biological similarities more effectively than traditional methods.

Contribution

The study proposes Align-gram, a new embedding scheme that incorporates biological insights into the Skip-gram model for enhanced protein sequence analysis.

Findings

01

Align-gram maps similar $k$-mers close in vector space.

02

Align-gram embeddings improve deep learning model training.

03

Demonstrated effectiveness on LSTM and CNN models.

Abstract

Background: The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the `language of life', has been analyzed for a multitude of applications and inferences. Motivation: Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. Results: We propose a novel $k$ -mer embedding scheme, Align-gram, which is capable of mapping the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nibtehaz/align-gram
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Topic Modeling

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory