A Workflow to Create a High-Quality Protein-Ligand Binding Dataset for Training, Validation, and Prediction Tasks
Yingze Wang, Kunyang Sun, Jie Li, Xingyi Guan, Oufan Zhang, Dorian, Bagni, and Teresa Head-Gordon

TL;DR
This paper introduces HiQBind-WF, a semi-automated workflow that curates high-quality protein-ligand datasets by fixing structural artifacts, thereby improving the reliability of scoring functions used in drug discovery.
Contribution
The authors developed an open-source, reproducible workflow to create a high-quality protein-ligand dataset, addressing issues in existing datasets like PDBbind.
Findings
The workflow effectively fixes common structural artifacts in protein-ligand datasets.
The resulting HiQBind dataset improves data quality for training and testing scoring functions.
Open-source nature promotes transparency and reproducibility in dataset curation.
Abstract
Development of scoring functions (SFs) used to predict protein-ligand binding energies requires high-quality 3D structures and binding assay data for training and testing their parameters. In this work, we show that one of the widely-used datasets, PDBbind, suffers from several common structural artifacts of both proteins and ligands, which may compromise the accuracy, reliability, and generalizability of the resulting SFs. Therefore, we have developed a series of algorithms organized in a semi-automated workflow, HiQBind-WF, that curates non-covalent protein-ligand datasets to fix these problems. We also used this workflow to create an independent data set, HiQBind, by matching binding free energies from various sources including BioLiP, Binding MOAD and BindingDB with co-crystalized ligand-protein complexes from the PDB. The resulting HiQBind workflow and dataset are designed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Bioinformatics and Genomic Networks · Genetics, Bioinformatics, and Biomedical Research
