Protein Representation Learning by Geometric Structure Pretraining
Zuobai Zhang, Minghao Xu, Arian Jamasb, Vijil Chenthamarakshan,, Aurelie Lozano, Payel Das, Jian Tang

TL;DR
This paper introduces a novel pretraining approach for protein representations based on 3D structural information, leveraging geometric features and contrastive learning to improve protein function and fold classification tasks with less data.
Contribution
It proposes a new geometric structure pretraining method for proteins that outperforms sequence-based models in key tasks, using fewer data.
Findings
Outperforms or matches state-of-the-art sequence-based methods.
Requires significantly less pretraining data.
Effective in both function prediction and fold classification.
Abstract
Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning in Bioinformatics · Protein Structure and Dynamics · Bioinformatics and Genomic Networks
MethodsContrastive Learning
