Macromolecule Classification Based on the Amino-acid Sequence

Faisal Ghaffar; Sarwar Khan; Gaddisa O.; Chen Yu-jhen

arXiv:2001.01717·q-bio.BM·September 23, 2022·1 cites

Macromolecule Classification Based on the Amino-acid Sequence

Faisal Ghaffar, Sarwar Khan, Gaddisa O., Chen Yu-jhen

PDF

Open Access

TL;DR

This paper applies deep learning models to classify amino acid sequences into DNA, RNA, protein, or hybrid classes, achieving nearly 99% accuracy using various neural network architectures.

Contribution

It introduces the use of NLP-inspired word embedding techniques for protein sequence classification with deep learning models.

Findings

01

Achieved 99% accuracy in classifying sequences

02

Compared CNN, LSTM, BiLSTM, and GRU architectures

03

Demonstrated effectiveness of NLP techniques in bioinformatics

Abstract

Deep learning is playing a vital role in every field which involves data. It has emerged as a strong and efficient framework that can be applied to a broad spectrum of complex learning problems which were difficult to solve using traditional machine learning techniques in the past. In this study we focused on classification of protein sequences with deep learning techniques. The study of amino acid sequence is vital in life sciences. We used different word embedding techniques from Natural Language processing to represent the amino acid sequence as vectors. Our main goal was to classify sequences to four group of classes, that are DNA, RNA, Protein and hybrid. After several tests we have achieved almost 99% of train and test accuracy. We have experimented on CNN, LSTM, Bidirectional LSTM, and GRU.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Fractal and DNA sequence analysis

MethodsTest · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Gated Recurrent Unit