Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction
Jie Li, Xingyi Guan, Oufan Zhang, Kunyang Sun, Yingze Wang, Dorian Bagni, and Teresa Head-Gordon

TL;DR
This study introduces a leak-proof version of the PDBBind dataset to evaluate protein-ligand scoring functions more reliably, demonstrating improved performance of retrained models on new complexes and proposing IGN as a top scorer.
Contribution
The paper provides a carefully curated, leak-proof dataset split for more accurate evaluation of scoring functions and retrains popular models to improve their generalizability on unseen data.
Findings
Retrained models perform better on leak-proof data.
IGN outperforms other scoring functions for new complexes.
The new dataset reduces data leakage issues in evaluation.
Abstract
The majority of machine learning scoring functions used in drug discovery for predicting protein-ligand binding poses and affinities have been trained on the PDBBind dataset. However, it is unclear whether these new scoring functions are actually an improvement over traditional models since often the training and test sets are cross-contaminated with proteins and ligands with high similarity, and hence they may not perform comparably well in binding prediction of unrelated protein-ligand complexes. In this work we have carefully prepared a new split of the PDBBind data set to control for data leakage, defined as proteins and ligands with high sequence and structural similarity. The resulting leak-proof (LP)-PDBBind data is used to retrain four popular SFs: AutoDock Vina, Random Forest (RF)-Score, InteractionGraphNet (IGN), and DeepDTA, to better test their capabilities when applied to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Protein Structure and Dynamics · Machine Learning in Materials Science
