Large-scale Pretraining Improves Sample Efficiency of Active Learning based Molecule Virtual Screening
Zhonglin Cao, Simone Sciabola, Ye Wang

TL;DR
Pretrained transformer and graph neural network models significantly enhance the efficiency and accuracy of active learning in large-scale molecule virtual screening, reducing computational costs in drug discovery.
Contribution
This study demonstrates that pretrained models improve sample efficiency and accuracy in active learning for molecule screening, outperforming previous methods on ultra-large libraries.
Findings
Pretrained models identify nearly 59% of top compounds after screening only 0.6%.
Pretrained models outperform baseline by 8% in identifying top compounds.
Performance gains are consistent across structure-based and ligand-based drug discovery.
Abstract
Virtual screening of large compound libraries to identify potential hit candidates is one of the earliest steps in drug discovery. As the size of commercially available compound collections grows exponentially to the scale of billions, brute-force virtual screening using traditional tools such as docking becomes infeasible in terms of time and computational resources. Active learning and Bayesian optimization has recently been proven as effective methods of narrowing down the search space. An essential component in those methods is a surrogate machine learning model that is trained with a small subset of the library to predict the desired properties of compounds. Accurate model can achieve high sample efficiency by finding the most promising compounds with only a fraction of the whole library being virtually screened. In this study, we examined the performance of pretrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science
MethodsLib · Graph Neural Network
