APEX: Approximate-but-exhaustive search for ultra-large combinatorial synthesis libraries

Aryan Pedawi; Jordi Silvestre-Ryan; Bradley Worley; Darren J Hsu; Kushal S Shah; Elias Stehle; Jingrong Zhang; Izhar Wallach

arXiv:2510.24380·cs.LG·October 29, 2025

APEX: Approximate-but-exhaustive search for ultra-large combinatorial synthesis libraries

Aryan Pedawi, Jordi Silvestre-Ryan, Bradley Worley, Darren J Hsu, Kushal S Shah, Elias Stehle, Jingrong Zhang, Izhar Wallach

PDF

3 Reviews

TL;DR

APEX is a neural network-based method that enables near-exhaustive virtual screening of ultra-large chemical libraries by predicting compound scores rapidly, improving the identification of top candidates in drug discovery.

Contribution

This work introduces APEX, a novel approximate-but-exhaustive search protocol that leverages neural networks for fast, near-complete enumeration of large chemical libraries, outperforming existing methods.

Findings

01

APEX achieves full enumeration of 10 million compounds in under a minute.

02

APEX accurately retrieves top-scoring compounds compared to alternative algorithms.

03

APEX demonstrates strong performance in both accuracy and runtime across benchmark datasets.

Abstract

Make-on-demand combinatorial synthesis libraries (CSLs) like Enamine REAL have significantly enabled drug discovery efforts. However, their large size presents a challenge for virtual screening, where the goal is to identify the top compounds in a library according to a computational objective (e.g., optimizing docking score) subject to computational constraints under a limited computational budget. For current library sizes -- numbering in the tens of billions of compounds -- and scoring functions of interest, a routine virtual screening campaign may be limited to scoring fewer than 0.1% of the available compounds, leaving potentially many high scoring compounds undiscovered. Furthermore, as constraints (and sometimes objectives) change during the course of a virtual screening campaign, existing virtual screening algorithms typically offer little room for amortization. We propose the…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. Evaluation of the method is reasonable: Includes five realistic molecular targets and measures runtime and recall, showing consistent acceleration over brute-force screening.

Weaknesses

1. Exceptionally limited related work. The paper omits extensive prior work in active learning and surrogate-based virtual screening, such as - MEMES: Machine Learning Framework for Enhanced Molecular Screening [1] - Accelerating High-Throughput Virtual Screening through Molecular Pool-Based Active Learning [2] - Generative AI for Navigating Synthesizable Chemical Space [3] These works already report more sophisticated active learning and synthesizability-aware loops, which APEX ne

Reviewer 02Rating 2Confidence 3

Strengths

- Problem importance: The paper tackles a very real bottleneck in structure-based and library-based discovery: current make-on-demand CSLs are now so large (10⁹–10¹⁰) that even “smart” virtual screening methods end up seeing <1% of the space, leaving high-scoring chemistry unexplored. A method that makes declarative, low-latency queries over the entire library is valuable to both method developers and practitioners. The paper also recognizes that constraints (Lipinski, Veber, fragment-like rules

Weaknesses

- Dependence on synthon + reaction paths for every library compound: APEX only works because every product in the library is addressable as “reaction + R-group + synthon” and the factorizer has been trained on that exact CSL structure. This is fine for Enamine-style, synthon-organized libraries, but it also means the method is not directly applicable to an arbitrary compound library (e.g., a corporate merged screening collection, ChEMBL-like flat sets, or ad-hoc AI-generated enumerations) unless

Reviewer 03Rating 4Confidence 4

Strengths

1. I appreciate that the authors explicitly address real-world virtual screening constraints that are often overlooked in academic papers. The paper recognizes that modern CSLs contain tens of billions of compounds, yet computational budgets typically allow evaluation of less than 0.1% of the library, which is a genuine bottleneck in industrial drug discovery settings. The emphasis on handling multiple constraints is also valuable, as constraint satisfaction is critical in real-world drug discov

Weaknesses

1. The text size in Figure 1 and Figure 2 is extremely small and nearly illegible at normal viewing resolution. I had to zoom to 500% magnification to read the text labels, annotations, and diagram components. The authors should significantly increase font sizes and improve the overall figure format. 2. The paper makes claims about screening 10 billion compounds using a surrogate model trained on only 1 million, yet provides insufficient evidence to support the reliability of this generalizatio

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.