# Query Matters: How Selection Strategies Influence Active Learning in Drug Discovery

**Authors:** Huw J. Williams, Stephen D. Pickett, Andrew Baxter, David S. Palmer

PMC · DOI: 10.1021/acs.jcim.5c02504 · 2026-02-26

## TL;DR

This paper introduces a simulation framework for drug discovery that shows how different selection strategies affect the efficiency of finding promising drug candidates.

## Contribution

The novel contribution is a machine learning-based simulation framework (SimDMTA) that evaluates query strategies in active learning for drug discovery.

## Key findings

- Uncertainty-based sampling outperforms greedy and hybrid approaches in hit discovery and model generalization.
- In the final iteration, 37 of the top 50 compounds were in the top 1% of the chemical space.
- Random selection strategies correct biases faster but are less effective at predicting top molecules.

## Abstract

We present SimDMTA, an in silico framework
designed
to simulate the Design–Make–Test–Analyze (DMTA)
cycle used in preclinical drug discovery. Using docking scores as
a proxy for biological assays, the simulations allow factors controlling
the efficiency of the DMTA cycle to be explored in a manner that would
not be feasible using traditional experiments due to time and cost
constraints. In this workflow, a machine learning model predicts docking
scores, selects compounds using various query strategies, docks selected
molecules, and retrains iteratively. Starting from a broad chemical
space, the model actively samples molecules derived from a 3,5-dimethyl-4-phenylisoxazole
scaffold, an active warhead for the Bromodomain 4 (BRD4) BD1 binding
site, to refine its predictions. Our results show that uncertainty-based
sampling significantly outperforms greedy and hybrid approaches in
both hit discovery and the ability of the model that predicts docking
scores to generalize beyond its training set. Notably, by the final
iteration, 37 of the top 50 ranked compounds were within the top 1%
of the chemical space of all evaluated compounds. Strategies that
include some random selection correct systematic biases more rapidly,
but are less effective at predicting top-performing molecules. These
findings underscore the value of incorporating molecular diversity
and uncertainty into design strategies. While such strategies may
deprioritize those molecules with the highest absolute predictions
in early rounds, they markedly accelerate model refinement, ultimately
leading to more effective hit identification in discovery driven by
active learning.

## Linked entities

- **Proteins:** BRD4 (bromodomain containing 4)
- **Chemicals:** 3,5-dimethyl-4-phenylisoxazole (PubChem CID 5325760)

## Full-text entities

- **Genes:** DEFB1 (defensin beta 1) [NCBI Gene 1672] {aka BD1, DEFB-1, DEFB101, HBD1}
- **Chemicals:** 3,5-dimethyl-4-phenylisoxazole (-)

## Figures

18 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13014458/full.md

---
Source: https://tomesphere.com/paper/PMC13014458