Evaluating representation learning on the protein structure universe

Arian R. Jamasb; Alex Morehead; Chaitanya K. Joshi; Zuobai; Zhang; Kieran Didi; Simon V. Mathis; Charles Harris; Jian Tang; and Jianlin Cheng; Pietro Lio; Tom L. Blundell

arXiv:2406.13864·cs.LG·June 21, 2024·6 cites

Evaluating representation learning on the protein structure universe

Arian R. Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai, Zhang, Kieran Didi, Simon V. Mathis, Charles Harris, Jian Tang, and Jianlin Cheng, Pietro Lio, Tom L. Blundell

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces ProteinWorkshop, a benchmark suite for evaluating protein structure representation learning with GNNs, demonstrating the benefits of large-scale pretraining and the advantages of equivariant models.

Contribution

It provides a comprehensive benchmark and open-source tools for systematic evaluation of protein structure representations using GNNs.

Findings

01

Pretraining on AlphaFold structures improves GNN performance.

02

Equivariant GNNs benefit more from pretraining than invariant models.

03

The benchmark facilitates fair comparison and progress in the field.

Abstract

We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. Modular benchmark enabling rapid evaluation of protein representation learning methods across various tasks, models, representations, and pre-training setups. 2. Analysis of model performance across these different representations and architectures. 3. Using auxiliary tasks to improve the performance of both invariant and equivariant models. 4. Providing tools and procedures for training and evaluating models.

Weaknesses

1. The work is missing an explanation of the limitations of the featurization schemes and pre-training tasks. 2. Would be beneficial to include a discussion about the generalizability of the benchmark results to the overall protein structure space, and how this translates to proteins not included in the current dataset. 3. Missing a discussion about how geometric models may be improved to surpass sequence-based models. 4. Missing information about the ease of use of the tools, and details ab

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

This paper is well-written and easy to follow. The provided datasets, GNN models, and training strategies are comprehensive.

Weaknesses

1. In addition to datasets etc, I think a good benchmark should also provide experimental results with well-searched hyperparameters. In such case, future researchers can directly take results for a fair comparison. - However, in the current version, the authors didn’t provide results on all downstream tasks. - In addition, I am not sure whether the hyperparameters are well-searched, since the best results reported here are still worse than some existing methods. For example, the best result

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

The unveiling of this framework, designed for assembling public datasets in order to generate pretraining and downstream benchmark datasets for the study of protein structure representation, is a noteworthy development. The examination of how pretraining and featurization affect various downstream architectures and tasks proves to be engaging and insightful. The timeliness and significance of this research topic cannot be understated, as it addresses the pressing need for a standardized framew

Weaknesses

Addressing the Issue of Potential Leakage: Efforts to mitigate potential data leakage are crucial to ensuring the integrity of benchmarking results, as such leakage could introduce misleading elements into the research findings. Have you considered the removal of overlapping sequences between the pretraining datasets and the downstream testing datasets to further safeguard against such issues? Expanding Featurization Methods: In terms of featurization, the paper seems to primarily focus on si

Code & Models

Repositories

a-r-j/proteinworkshop
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProtein Structure and Dynamics · Genetics, Bioinformatics, and Biomedical Research · Machine Learning in Bioinformatics

MethodsAlphaFold