Scaffold Splits Overestimate Virtual Screening Performance

Qianrong Guo; Saiveth Hernandez-Hernandez; and Pedro J Ballester

arXiv:2406.00873·q-bio.QM·July 2, 2024·6 cites

Scaffold Splits Overestimate Virtual Screening Performance

Qianrong Guo, Saiveth Hernandez-Hernandez, and Pedro J Ballester

PDF

Open Access

TL;DR

This study demonstrates that scaffold splits, commonly used in virtual screening model evaluation, overestimate performance by creating overly similar training and test sets, highlighting the need for more realistic data splitting methods.

Contribution

The paper reveals that scaffold splits overestimate virtual screening performance and advocates for more realistic splitting methods like UMAP clustering.

Findings

01

Model performance drops significantly with UMAP splits.

02

Scaffold splits create unrealistic similarities between training and test sets.

03

More realistic data splits are essential for accurate model evaluation.

Abstract

Virtual Screening (VS) of vast compound libraries guided by Artificial Intelligence (AI) models is a highly productive approach to early drug discovery. Data splitting is crucial for better benchmarking of such AI models. Traditional random data splits produce similar molecules between training and test sets, conflicting with the reality of VS libraries which mostly contain structurally distinct compounds. Scaffold split, grouping molecules by shared core structure, is widely considered to reflect this real-world scenario. However, here we show that the scaffold split also overestimates VS performance. The reason is that molecules with different chemical scaffolds are often similar, which hence introduces unrealistically high similarities between training molecules and test molecules following a scaffold split. Our study examined three representative AI models on 60 NCI-60 datasets,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMolecular Biology Techniques and Applications