Protein Representation Learning by Geometric Structure Pretraining

Zuobai Zhang; Minghao Xu; Arian Jamasb; Vijil Chenthamarakshan,; Aurelie Lozano; Payel Das; Jian Tang

arXiv:2203.06125·cs.LG·January 31, 2023·40 cites

Protein Representation Learning by Geometric Structure Pretraining

Zuobai Zhang, Minghao Xu, Arian Jamasb, Vijil Chenthamarakshan,, Aurelie Lozano, Payel Das, Jian Tang

PDF

Open Access 2 Repos 1 Datasets 1 Video

TL;DR

This paper introduces a novel pretraining approach for protein representations based on 3D structural information, leveraging geometric features and contrastive learning to improve protein function and fold classification tasks with less data.

Contribution

It proposes a new geometric structure pretraining method for proteins that outperforms sequence-based models in key tasks, using fewer data.

Findings

01

Outperforms or matches state-of-the-art sequence-based methods.

02

Requires significantly less pretraining data.

03

Effective in both function prediction and fold classification.

Abstract

Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Oxer11/Protein-Function-Annotation
dataset· 22 dl
22 dl

Videos

Protein Representation Learning by Geometric Structure Pretraining· slideslive

Taxonomy

TopicsMachine Learning in Bioinformatics · Protein Structure and Dynamics · Bioinformatics and Genomic Networks

MethodsContrastive Learning