PDFBench: A Benchmark for De novo Protein Design from Function

Jiahao Kuang; Nuowei Liu; Jie Wang; Changzhi Sun; Tao Ji; Yuanbin Wu

arXiv:2505.20346·cs.LG·September 30, 2025

PDFBench: A Benchmark for De novo Protein Design from Function

Jiahao Kuang, Nuowei Liu, Jie Wang, Changzhi Sun, Tao Ji, Yuanbin Wu

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

PDFBench introduces a comprehensive evaluation framework for function-guided de novo protein design, enabling fair comparison of models across multiple metrics and settings, thus advancing research in drug discovery and enzyme engineering.

Contribution

The paper presents the first unified benchmark for protein design evaluation, systematically assessing multiple models and metrics to improve comparability and understanding.

Findings

01

Benchmarking eight models across 16 metrics reveals diverse strengths and weaknesses.

02

Correlation analysis provides insights into metric relationships and evaluation robustness.

03

SwissTest dataset ensures data integrity with a strict datetime cutoff.

Abstract

Function-guided protein design is a crucial task with significant applications in drug discovery and enzyme engineering. However, the field lacks a unified and comprehensive evaluation framework. Current models are assessed using inconsistent and limited subsets of metrics, which prevents fair comparison and a clear understanding of the relationships between different evaluation criteria. To address this gap, we introduce PDFBench, the first comprehensive benchmark for function-guided denovo protein design. Our benchmark systematically evaluates eight state-of-the-art models on 16 metrics across two key settings: description-guided design, for which we repurpose the Mol-Instructions dataset, originally lacking quantitative benchmarking, and keyword-guided design, for which we introduce a new test set, SwissTest, created with a strict datetime cutoff to ensure data integrity. By…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The benchmark fills an important gap in functional protein design and provides a unified, reproducible evaluation framework. 2. The paper is well-organized, and the rethinking section is particularly informative and insightful.

Weaknesses

1. The dataset and code links are currently inaccessible (“The requested file is not found”), which limits reproducibility. 2. Many evaluation metrics overlap with those used in ProDVa, raising concerns about the degree of novelty beyond benchmark integration.

Reviewer 02Rating 6Confidence 3

Strengths

Authors have considered "Pausibility", "Foldability", "Language Alignment", "Novelty" and "Diversity", most of which are well-established metrics and are wildly accepted in different works. The text-guided protein design is a relatevly a new task without current benchmark. The design of the benchmark is good, divide tasks into "text-guided" and "keyword-guided" is practical and useful.

Weaknesses

Generally, the task that using function to design protein is not well-accepted because of low controability of the design process, and using traditional pipeline to design protein using RFdiffusion and ProteinMPNN can also achieve similar task (describe function using structure, or using additional models). Add more discussion that comparing traditional workflow with the function2protein workflow is useful in this paper.

Reviewer 03Rating 2Confidence 3

Strengths

The paper's primary strength is the attempt to address a clear gap in the field: the lack of a unified and comprehensive evaluation framework for function-guided (text-guided) protein design models. The systematic effort to evaluate multiple state-of-the-art models across a variety of metrics is a valuable starting point.

Weaknesses

The proposed benchmark suffers from several critical weaknesses that undermine its conclusions and practical relevance: 1. The core motivation of text-guided design is questionable. Coarse-grained descriptions like Gene Ontology (GO) terms, EC numbers, or keywords are fundamentally insufficient for specifying novel, complex functions (e.g., designing a high-affinity binder for a specific, newly discovered target). This limits the real-world utility of the entire methodology. 2. The benchmark uti

Code & Models

Datasets

Knlife/SwiwwProtIPG
dataset· 121 dl
121 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMicrobial Metabolic Engineering and Bioproduction