Data Leakage and Redundancy in the LIT-PCBA Benchmark
Amber Huang, Ian Scott Knight, Slava Naprienko

TL;DR
This paper critically examines the LIT-PCBA benchmark for virtual screening, revealing extensive data leakage and redundancy that inflate performance metrics and undermine its validity for assessing true generalization in models.
Contribution
The study uncovers severe data leakage issues in LIT-PCBA, demonstrating that current benchmarks are unreliable for evaluating genuine model generalization capabilities.
Findings
Widespread 2D-identical ligands within and across splits.
Analog overlap and low-diversity query sets compromise evaluation.
A trivial baseline can match or outperform state-of-the-art models.
Abstract
LIT-PCBA is widely used to benchmark virtual screening models, but our audit reveals that it is fundamentally compromised. We find extensive data leakage and molecular redundancy across its splits, including 2D-identical ligands within and across partitions, pervasive analog overlap, and low-diversity query sets. In ALDH1 alone, for instance, 323 active training -- validation analog pairs occur at ECFP4 Tanimoto similarity ; across all targets, 2,491 2D-identical inactives appear in both training and validation, with very few corresponding actives. These overlaps allow models to succeed through scaffold memorization rather than generalization, inflating enrichment factors and AUROC scores. These flaws are not incidental -- they are so severe that a trivial memorization-based baseline with no learnable parameters can exploit them to match or exceed the reported performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
