ProteinBench: A Holistic Evaluation of Protein Foundation Models
Fei Ye, Zaixiang Zheng, Dongyu Xue, Yuning Shen, Lihao Wang, Yiming, Ma, Yan Wang, Xinyou Wang, Xiangxin Zhou, Quanquan Gu

TL;DR
ProteinBench provides a comprehensive, multi-dimensional evaluation framework for protein foundation models, addressing current gaps in understanding their capabilities and limitations through standardized metrics and analyses.
Contribution
It introduces a holistic, multi-metric evaluation framework for protein models, including a taxonomy, performance metrics, and analysis tools, with publicly available resources.
Findings
Reveals strengths and weaknesses of current protein models
Highlights areas for improvement in robustness and diversity
Provides a standardized benchmark for future research
Abstract
Recent years have witnessed a surge in the development of protein foundation models, significantly improving performance in protein prediction and generative tasks ranging from 3D structure prediction and protein design to conformational dynamics. However, the capabilities and limitations associated with these models remain poorly understood due to the absence of a unified evaluation framework. To fill this gap, we introduce ProteinBench, a holistic evaluation framework designed to enhance the transparency of protein foundation models. Our approach consists of three key components: (i) A taxonomic classification of tasks that broadly encompass the main challenges in the protein domain, based on the relationships between different protein modalities; (ii) A multi-metric evaluation approach that assesses performance across four key dimensions: quality, novelty, diversity, and robustness;…
Peer Reviews
Decision·ICLR 2025 Poster
- The framework’s taxonomy of tasks within the domain of protein foundation models is insightful. It makes it easier to evaluate where each model excels or falls short. - The multi-dimensional metrics aims to capture various aspects of model performance which is appropriate given the complexity of the protein modeling. - The authors conduct a large number of experiments, demonstrating the breadth of the evaluation and ensuring the results' validity across various models and tasks. - Leaderboa
- Given that the authors have made an extensive amount of experimental study, some reorganization of the paper could strengthen the delivery of the contributions of the paper. Including clear and complete definitions, explanations, and relevance of the metrics would be helpful. The relevance and insights of the results could replace the explanations of the results. For example, Section 2.2.6 Antibody Design, instead of listing the outperforming models for evaluation, which is provided in Table 6
See main review.
See main review.
Novel Evaluation Framework: The paper proposes a well-structured framework that standardizes evaluation for protein foundation models, addressing a significant need in the field. By evaluating on multiple fronts—quality, novelty, diversity, and robustness—ProteinBench gives a well-rounded assessment of model performance. Task Diversity and Practical Relevance: ProteinBench is inclusive of various protein modeling tasks, including antibody design and multi-state prediction, which are highly rele
Lack of Standardized Training Data: Differences in training datasets among models hinder direct comparison. Standardizing datasets would improve the ability to compare model architectures and may be essential for achieving fairer assessments within ProteinBench.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies
