TL;DR
This paper introduces ProtVec, a neural network-based continuous vector representation for protein sequences, enabling improved classification, structure prediction, and disordered protein identification in bioinformatics.
Contribution
It presents a novel neural network approach to generate dense vector representations of proteins, enhancing various bioinformatics tasks over existing methods.
Findings
Achieved 93% accuracy in protein family classification.
Distinguished disordered from structured proteins with up to 100% accuracy.
Outperformed existing family classification methods.
Abstract
We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
