Evaluating representation learning on the protein structure universe
Arian R. Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai, Zhang, Kieran Didi, Simon V. Mathis, Charles Harris, Jian Tang, and Jianlin Cheng, Pietro Lio, Tom L. Blundell

TL;DR
This paper introduces ProteinWorkshop, a benchmark suite for evaluating protein structure representation learning with GNNs, demonstrating the benefits of large-scale pretraining and the advantages of equivariant models.
Contribution
It provides a comprehensive benchmark and open-source tools for systematic evaluation of protein structure representations using GNNs.
Findings
Pretraining on AlphaFold structures improves GNN performance.
Equivariant GNNs benefit more from pretraining than invariant models.
The benchmark facilitates fair comparison and progress in the field.
Abstract
We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our…
Peer Reviews
Decision·ICLR 2024 poster
1. Modular benchmark enabling rapid evaluation of protein representation learning methods across various tasks, models, representations, and pre-training setups. 2. Analysis of model performance across these different representations and architectures. 3. Using auxiliary tasks to improve the performance of both invariant and equivariant models. 4. Providing tools and procedures for training and evaluating models.
1. The work is missing an explanation of the limitations of the featurization schemes and pre-training tasks. 2. Would be beneficial to include a discussion about the generalizability of the benchmark results to the overall protein structure space, and how this translates to proteins not included in the current dataset. 3. Missing a discussion about how geometric models may be improved to surpass sequence-based models. 4. Missing information about the ease of use of the tools, and details ab
This paper is well-written and easy to follow. The provided datasets, GNN models, and training strategies are comprehensive.
1. In addition to datasets etc, I think a good benchmark should also provide experimental results with well-searched hyperparameters. In such case, future researchers can directly take results for a fair comparison. - However, in the current version, the authors didn’t provide results on all downstream tasks. - In addition, I am not sure whether the hyperparameters are well-searched, since the best results reported here are still worse than some existing methods. For example, the best result
The unveiling of this framework, designed for assembling public datasets in order to generate pretraining and downstream benchmark datasets for the study of protein structure representation, is a noteworthy development. The examination of how pretraining and featurization affect various downstream architectures and tasks proves to be engaging and insightful. The timeliness and significance of this research topic cannot be understated, as it addresses the pressing need for a standardized framew
Addressing the Issue of Potential Leakage: Efforts to mitigate potential data leakage are crucial to ensuring the integrity of benchmarking results, as such leakage could introduce misleading elements into the research findings. Have you considered the removal of overlapping sequences between the pretraining datasets and the downstream testing datasets to further safeguard against such issues? Expanding Featurization Methods: In terms of featurization, the paper seems to primarily focus on si
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Structure and Dynamics · Genetics, Bioinformatics, and Biomedical Research · Machine Learning in Bioinformatics
MethodsAlphaFold
