Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information
Seonwoo Min, Seunghyun Park, Siwon Kim, Hyun-Soo Choi, Byunghan Lee,, Sungroh Yoon

TL;DR
This paper introduces PLUS, a novel pre-training scheme for protein sequences that incorporates structural information and improves performance on multiple biological tasks.
Contribution
The paper proposes PLUS, a new pre-training method combining masked language modeling with a structural prediction task, enhancing protein representation learning.
Findings
PLUS-RNN outperforms similar models on six of seven tasks
PLUS effectively exploits evolutionary relationships among proteins
The method is broadly applicable across protein biology tasks
Abstract
Bridging the exponentially growing gap between the numbers of unlabeled and labeled protein sequences, several studies adopted semi-supervised learning for protein sequence modeling. In these studies, models were pre-trained with a substantial amount of unlabeled data, and the representations were transferred to various downstream tasks. Most pre-training methods solely rely on language modeling and often exhibit limited performance. In this paper, we introduce a novel pre-training scheme called PLUS, which stands for Protein sequence representations Learned Using Structural information. PLUS consists of masked language modeling and a complementary protein-specific pre-training task, namely same-family prediction. PLUS can be used to pre-train various model architectures. In this work, we use PLUS to pre-train a bidirectional recurrent neural network and refer to the resulting model as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · RNA and protein synthesis mechanisms
