PEvoLM: Protein Sequence Evolutionary Information Language Model
Issar Arab

TL;DR
PEvoLM introduces a novel protein language model that efficiently captures evolutionary information from protein sequences by combining NLP techniques with multi-task learning, reducing parameters and improving biological relevance.
Contribution
This work presents a new bidirectional language model for proteins that integrates PSSM-based evolutionary data with transfer learning, using fewer parameters than traditional models.
Findings
Model learns evolutionary features effectively.
Reduces model complexity by four times.
Open-source implementation available for community use.
Abstract
With the exponential increase of the protein sequence databases over time, multiple-sequence alignment (MSA) methods, like PSI-BLAST, perform exhaustive and time-consuming database search to retrieve evolutionary information. The resulting position-specific scoring matrices (PSSMs) of such search engines represent a crucial input to many machine learning (ML) models in the field of bioinformatics and computational biology. A protein sequence is a collection of contiguous tokens or characters called amino acids (AAs). The analogy to natural language allowed us to exploit the recent advancements in the field of Natural Language Processing (NLP) and therefore transfer NLP state-of-the-art algorithms to bioinformatics. This research presents an Embedding Language Model (ELMo), converting a protein sequence to a numerical vector representation. While the original ELMo trained a 2-layer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · RNA and protein synthesis mechanisms
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Softmax · Bidirectional LSTM · ELMo
