ArtiDock: Accurate Machine Learning Approach to Protein–Ligand Docking Optimized for High-Throughput Virtual Screening
Taras Voitsitskyi, Ihor Koleiev, Roman Stratiichuk, Oleksandr Kot, Roman Kyrylenko, Illia Savchenko, Vladyslav Husak, Semen Yesylevskyy, Sergii Starosyla, Alan Nafiiev

TL;DR
ArtiDock is a machine learning-based docking method that improves accuracy and efficiency for drug discovery screening.
Contribution
ArtiDock introduces a novel ML-based docking approach optimized for high-throughput virtual screening with improved accuracy and performance.
Findings
ArtiDock is 29–38% more accurate than leading docking tools like AutoDock, Vina, and Glide.
It excels in challenging scenarios with unbound proteins and binding sites containing ions and structured water.
ArtiDock achieves competitive accuracy at higher throughput compared to AI docking and cofolding methods.
Abstract
Classical protein–ligand docking has been a cornerstone technique in computational drug discovery for decades but has reached an accuracy and performance plateau. Recently introduced Machine Learning (ML)-based docking methods offer a promising paradigm shift, but their practical adoption is hampered by accuracy-to-speed trade-offs, inadequate benchmarking standards, and questionable chemical validity of predicted poses. In this study, we introduce ArtiDockan ML-based docking technique optimized for high-throughput virtual screening applications. To evaluate ArtiDock, we developed a dedicated performance and accuracy benchmark for pocket-specific rigid protein–ligand docking, which mimics realistic industrial drug discovery scenarios and is based on the novel PLINDER data set. We demonstrate that ArtiDock is 29–38% more accurate in comparison to leading open-source and commercial…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
1
2
3
4
5
6
7|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| AutoDock-CPU | 3.5 | 0.32 | 0.56 | 0.56 | 0.90 | 0.30 | 18 |
| AutoDock-GPU | 3.2 | 0.35 | 0.60 | 0.60 | 0.91 | 0.33 | 17 |
| Vina | 2.9 | 0.38 | 0.62 | 0.59 |
| 0.36 | 22 |
| Glide | 2.8 | 0.39 | 0.64 | 0.62 | 0.91 | 0.36 | 66 |
| ArtiDock |
|
|
|
| 0.72 | 0.38 | 0 |
| ArtiDock + Vina | 2.2 | 0.45 | 0.70 | 0.74 | 0.84 |
| 19 |
| ArtiDock + UFF | 2.3 | 0.41 | 0.69 | 0.76 |
|
| 14 |
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| AutoDock-CPU | 5.9 | 0.10 | 0.30 | 0.23 | 0.91 | 0.10 | 6 |
| AutoDock-GPU | 6.2 | 0.09 | 0.28 | 0.21 | 0.91 | 0.09 | 5 |
| Vina | 6.5 | 0.08 | 0.27 | 0.17 |
| 0.07 | 5 |
| Glide | 5.5 | 0.13 | 0.33 | 0.26 |
| 0.13 | 54 |
| ArtiDock |
|
|
|
| 0.65 |
| 0 |
| ArtiDock + Vina | 4.2 | 0.19 | 0.46 | 0.43 | 0.80 |
| 5 |
| ArtiDock + UFF | 3.9 | 0.16 | 0.47 | 0.45 | 0.89 | 0.15 | 6 |
- —H2020 Marie Sklodowska-Curie Actions10.13039/100010665
- —Ministerstvo ?kolstv?, Ml?de?e a Telov?chovy10.13039/501100001823
- —National Academy of Sciences of Ukraine10.13039/501100004742
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Protein Structure and Dynamics · Protein Degradation and Inhibitors
Introduction
Classical or physics-based docking has been one of the foundational tools in computational drug discovery for decades. During this extended period, the docking technology has witnessed an impressive scaling up and a universal adoption in academia and industry, while remaining remarkably unchanged in terms of algorithms used. Indeed, all the major docking scoring functions have not evolved much since their introduction. All inherent limitations of classical docking, such as poor handling of the metal atoms, coordination bonds, polarization, charge transfer, and entropic contribution from water, persist to date with no clear trend for improvement.
While classical docking has traditionally relied on rigid receptor approximations for speed, significant efforts have been made to incorporate protein flexibility. Notable examples include MedusaDock, ?,? Glide IFD,? and AutoDock Vina ?,? flexible docking. While these methods offer improved accuracy in cross-docking scenarios, their higher computational cost often limits their application in the initial stages of high-throughput virtual screening, where rigid receptor approaches remain the standard.
There is currently an implicit consensus in the community that classical docking has reached the practical plateau of accuracy, and no considerable progress could be made without a paradigm change. ?,?
Such a change has emerged in recent years with the appearance of the ML techniques for ligand pose prediction, often colloquially called an “AI docking”. In contrast to the classical docking, which is based on minimization of some physics-based scoring function, these techniques leverage a completely data-centric approach by learning from experimentally determined protein–ligand complexes.
The first generation of the ML docking techniques, such as DeepDock,? TANKBind,? EquiBind,? and Uni-Mol,? demonstrated results that were subpar to conventional docking in terms of accuracy and chemical validity,? while being significantly faster. More heavyweight tools, such as the diffusion model DiffDock, ?,? or cofolding methods RoseTTAFold All-Atom,? NeuralPLexer3,? AlphaFold3,? Chai-1,? Boltz-1,? and Protenix,? have shown an impressive boost in accuracy that comes, however, at the expense of very complex architectures, large model sizes, and slow training and inference.
However, it quickly became evident that accurate and unbiased assessment of ML docking techniques remains challenging due to the absence of commonly accepted, reliable benchmarks suitable for these data-driven methods. The majority of current evaluations overlook the data leakage between training and test data sets, which can result in overly optimistic performance metrics. ?,? In particular, the performance of cofolding tools, often showing the greatest accuracy, appears to be strongly correlated with the similarity between their test and training molecular structures.?
As a result, existing ML docking tools are unable to replace the classical ones in the practical high-throughput virtual screening tasks for two main reasons: an unfavorable accuracy-to-speed ratio and the absence of proper evaluation for ML techniques. In other words, we still have not reached the much-anticipated docking paradigm shift.
The recently released PLINDER data set? addresses most of the existing benchmarking challenges. This data set contains a large and diverse predefined training subset from the Protein Data Bank (PDB).? The train-test splitting process is cluster-based, which minimizes both the data leakage between the sets and the test set redundancy. The split considers both protein sequence similarity and protein–ligand interaction similarity, while prioritizing the inclusion of biologically relevant, high-quality structures in the test.?
In this work, we evaluated the latest version of our proprietary ML-based protein–ligand docking tool, ArtiDock (Figure), against the leading open-source and commercial classical docking programs such as AutoDock4 (referred to as AutoDock-CPU hereafter), ?,? AutoDock-GPU,? Vina, ?,? and Glide. ?−? ? ? Our primary goal was to evaluate these techniques in the context of high-throughput rigid docking into a known predefined protein binding pocket, which reflects the most common use case in real-world drug discovery projects.
General scheme of ArtiDock model architecture and inference pipeline.
To ensure an unbiased evaluation of ArtiDock, we retrained the model on the PLINDER training split and assessed its final performance, alongside competitors, on the PLINDER test set. The evaluation covers docking accuracy, chemical and geometric validity of the predicted poses, and computational efficiency. The test cases also represented scenarios when the input protein structure is predicted or obtained from the ligand-unbound state, as well as cases with organic cofactors, ions, or water molecules in the binding pockets.
In most comparisons between traditional and ML-based docking methods, default suboptimal parameters are used for the classical tools, or parameter selection does not consider computational cost. ?,? This is an important omission, as two techniques with comparable accuracy may have very different throughput and thus different utility in real-world applications. To address this, we conducted a series of experiments to optimize traditional docking methods for the trade-off between accuracy and runtime, ensuring a fair and practically relevant setup for evaluation.
In order to provide an estimate on how ArtiDock performs in comparison to popular ML-based solutions without the need to retrain heavyweight architectures on the PLINDER split, we used PoseX benchmark.? This benchmark was built to evaluate both self-docking and cross-docking (unbound protein state) in a practical setting, covering a broad range of physics-based, ?,?,?−? ? AI docking, ?,?,?,?,?,?−? ? ? ? ? and AI cofolding ?,?−? ? ?,? tools (total of 23 methods).
Our results demonstrate that ArtiDock considerably outperforms all tested physics-based methods and competes with state-of-the-art ML models in terms of accuracy while offering throughput on par with the fastest tools (detailed runtime analysis provided in Appendix E). Thus, our ML docking approach pretends to be a method of choice in high-throughput virtual screening applications.
In this paper, we describe ArtiDock architecture, inference pipeline, training data preparation, and training procedure; outline the selection of optimal classical docking parameters; provide a comprehensive performance comparison and discuss the limitations and future directions.
Methods
Evaluation Data Sets
For the classical docking parametrization experiments, we used the PoseBusters Benchmark set? that consists of 308 crystal complexes from the PDB, each representing a unique protein–ligand pair (PLP). We removed 2 PLPs that failed to be processed by the evaluation utility.
The test split of the PLINDER data set (release 2024-06, iteration v2)? was selected for the final benchmarking of classical docking approaches and ArtiDock. After the initial preprocessing and quality filtering (Appendix A), the test data contained 1,035 PLPs with 676 unique ligand Chemical Component Dictionary (CCD) codes. We retained a single ligand per CCD that forms the maximum number of noncovalent interactions with the protein. As a result, the final PLINDER subset consisted of 676 PLPs from 660 systems representing 656 PDB entries. We removed one PLP that failed to be processed by the evaluation utility. All the nonprotein entities (heteroatoms) were kept in the test systems to assess the methods’ performance on the pockets containing cofactors, ions, and water molecules.
We also used a subset of PLINDER test systems proposed in the Machine Learning in Structural Biology (MLSB) 2024 challenge.? The PLINDER MLSB test comprises 346 single ligand-single protein systems. The main features of this data set are unbound input protein structures from another system (apo conformations) and protein structures predicted by ML methods.? This data set represents a realistic docking scenario in which the exact bound (holo) conformation of the protein is unknown. To ensure consistent pocket definitions across both unbound and bound structures, we superimposed the apo/predicted proteins onto their holo counterparts with the MolAR molecular modeling library,? aligning the ligand-binding sites, defined as all residues within 8 Å of the resolved ligand. After the preprocessing, the final PLINDER MLSB test set contained 344 PLPs.
As we retrained the ArtiDock on the PLINDER train split, the data leakage between the splits is kept at a minimum, which avoids potential overfitting to any test samples and ensures objective model capabilities evaluation. However, this prevents us from including other ML-based docking techniques in the comparison because they also have to be retrained on the same split, which is not technically feasible.
Nevertheless, we provided a comparison of ArtiDock to ML-based solutions using the PoseX benchmark.? The data set comprises 718 self-docking entries (holo conformations) and 1,312 cross-docking entries (apo conformations) spanning 109 protein targets from 371 PDB entries and 362 unique small molecules. Critically, PoseX is built with a strict time-based inclusion in which only PDB complexes released between 2022 and 01-01 and 2025-01-01 are retained. This temporal split is effective for comparing off-the-shelf tools without retraining. Note, this time-based split is not optimal for comparison of ML methods, where fixed structure-based train/validation/test partitions would be preferable when sufficient computational resources for retraining are available. For the PoseX evaluation, we retrained the ArtiDock model on PLINDER PLPs released before 2022-01-01.
Evaluation Metrics
To evaluate the predicted ligand poses, we used three types of scores:
- Symmetry-corrected root mean square deviation (RMSD): this score represents absolute deviation between experimentally determined and predicted ligand poses.?
- Local distance difference test for protein–ligand interactions (lDDT-PLI): this score characterizes the extent of reproduction of the protein–ligand interactions present in the experimental complexes.?
- PoseBusters quality checks, designed to measure the physical and chemical correctness of predicted ligand poses.?
The scores were calculated with the PLINDER Python API that, under the hood, used OpenStructure v.2.8.0 ?,? and PoseBusters v.0.3.1? packages.
For the benchmarking, we selected the top-ranked predicted pose based on the internal scoring function of each technique and reported the median RMSD, fraction of samples with RMSD < 2 Å, median lDDT-PLI, fraction of samples with lDDT-PLI > 0.5, and fraction of samples passing all PoseBusters quality checks (PB-Valid).
The PoseX benchmark included the fraction of samples with RMSD < 2 Å and the fraction of samples with both RMSD < 2 Å and PB-Valid. For the cross-docking scenario, the averaged fraction at the target level was reported because of the uneven distribution of complex numbers per target.?
Classical Docking Setup
AutoDock-Based Methods
The AutoDock-CPU, AutoDock-GPU, and Vina require input protein and ligand in a PDBQT format. The same format is used to represent output ligand poses. Therefore, we applied a unified pre/postprocessing pipeline for these docking tools.
The ligand preparation and postdocking processing were performed using the Python API of Meeko v.0.6.1.? For the AutoDock-CPU and AutoDock-GPU, ligand macrocycles were set to be rigid to avoid chemically invalid structures in the output.
The protein PDBQT files were obtained with MGLTools v.1.5.7? using options to protonate molecules and merge charges from nonpolar hydrogens and lone pairs. We noticed that this tool fails to assign charges to water molecules and some metals. To rectify that, we added common oxidation states for these entities? in the final files.
Note that there were some differences in the protein and ligand preparation for docking parametrization experiments and for the final PLINDER benchmark described in Appendix B.
For the AutoDock-CPU and AutoDock-GPU, grid parameter files (GPF) were created with MGLTools, and AutoGrid v.4.2.6? was used for the grid generation.
The AutoDock-CPU docking was launched with AutoDock v.4.2.6? using an MGLTools-generated docking parameter file (DPF).
The AutoDock-GPU v.1.6 docking was compiled from source? with DEVICE = GPU, and NUMWI = 256.
The Vina v.1.2.6? Python API was used to compute affinity maps and perform docking.
Due to the random variability of AutoDock-CPU, AutoDock-GPU, and Vina outputs between the launches, all the results are reported as the average of three independent docking runs.
Glide
We used Schrödinger Suite v.2024-3 command-line interface to prepare the input data and to launch the Glide docking.
The ligand molecules were processed with the LigPrep tool and default parameters. Proteins were prepared with PrepWizard, explicitly retaining all the water molecules in the processed protein structures.
Classical Docking Parametrization
To obtain an optimal docking setup in terms of time/quality trade-off, we launched a range of experiments with varying docking parameters for each of the techniques. In addition, we experimented with the protein pocket representation to understand how the size and shape of a pocket impact final performance. The tested parameters and pocket setups are detailed in Appendix B.
ArtiDock Model
Feature Extraction
The ligands were represented as atom-level molecular graphs. The node features included one-hot encoded atom group and the period from the periodic table. The graph edges represented covalent bonds between heavy atoms and the one-hot encoded covalent bond type.
Protein pocket node features were extracted for each heavy atom and included scalar (one-hot encoded atom group and period indexes, residue and atom names) and vector (distance from a pocket centroid to an atom) components. The graph edges were formed between a node and its 30 closest neighbors. The edge scalar features represented positional embedding calculated by the distance radial basis function.? The vectors between the nodes connected by edges were considered edge vector features.
Architecture
The ArtiDock is based on a proprietary model architecture inspired by the lightweight Trigonometry-Aware Neural Networks.? The model was built with an open-source machine learning framework, PyTorch v.2.5.1,? and a Graph Neural Network (GNN) library, PyTorch Geometric v.2.6.1.?
The ligand and pocket atom-level graphs were encoded using GNNs. The ligand graph was embedded by the graph isomorphism operator? (modified to incorporate edge features in the PyTorch Geometric library). The pocket graph was passed through Geometric Vector Perceptron GNN (GVP-GNN), ?,? which demonstrated superior performance in a range of tasks involving learning from protein structure. GVP-GNN uses both scalar and vector graph features as input. The resulting scalar and vector outputs are invariant and equivariant, respectively, for any rotation or reflection of a protein pocket in 3D Euclidean space.
After the graph encoders, the pocket and ligand node embeddings were combined in the pocket-ligand intermolecular embedding , where p – number of pocket nodes; c – number of ligand nodes; d – embedding size.
The intermolecular embedding was updated by a stack of blocks, each consisting of:
- update by intramolecular distance embedding of pocket and ligand ;
- multihead self-attention;
- nonlinear transition by MLP.
In the end, intermolecular embedding was converted into a pocket-ligand distance matrix by a linear transformation followed by normalization.
Inference Pipeline
The model outputs the pocket-ligand intermolecular distance matrix DM^ pc ^ for a given pocket and small molecule ligand. Thus, an additional distance matrix-to-pose algorithm is needed to convert the matrix into ligand atom coordinates (the actual binding pose). ArtiDock utilizes an algorithm that first infers a 3D point cloud from the distance matrix and then aligns a ligand conformer to the generated cloud. The algorithm details are available in Appendix C.
Physics-Based Optimization
In an effort to maximize the PoseBusters validity of ArtiDock inference pipeline output, we considered two postprocessing strategies for the quick physics-based local minimization of the ligand binding poses:
- Local energy optimization by Vina v.1.2.6? for up to 500 steps using Broyden-Fletcher–Goldfarb–Shanno method.?
- Universal force field (UFF)? minimization for up to 500 iterations using RDKit v.2024.9.5. All pocket atoms were frozen, and only the ligand molecule was allowed to move.
ArtiDock Training
Data Set Preparation
We used the train split of the PLINDER data set as the model training data set. The raw data set contains 309,140 protein–ligand systems from 76,901 PDB IDs, comprising 34,103 unique ligands as identified by their CCD codes. The number of PLPs is 344,033.
After the initial preprocessing and quality filtering (Appendix A), the training data set consisted of 227,066 PLPs from 206,555 systems and 67,888 PDB IDs. The complexes contained 29,440 unique ligand CCD codes. 34% of PLPs contained the ligands annotated as cofactors. 38% of PLPs comprised the ligands that follow Lipinski’s rule of five.?
It is worth noting that, according to PLINDER metadata, 9% of the filtered PLPs involve ligands covalently bound to their targets. However, these were retained in the data set as they met the quality filter criteria.
Since we retained all the entities in the complexes, each PLP may contain other ligands, ions, and water as part of the ligand binding pocket. In the filtered data, 12% of PLPs contained other noncofactor ligands in the pockets; 8% contained cofactors; 16% included at least one ion. 48% of ligands had at least one water bridge interaction.
For the PoseX benchmark, we trained the model on the 275,117 quality-filtered PLINDER PLPs (from all splits) released before 2022-01-01. The data contained 248,910 systems, 84,061 PDB IDs, and 34,460 unique ligand CCD codes.
Data Clustering
Many molecular structures are naturally overrepresented in the PDB as popular drug targets, model proteins, common drug-like ligands, cofactors, etc. For instance, among PLINDER PLPs, 3.6% of binding sites are represented by the HIV envelope glycoprotein GP120. Moreover, nearly 18% of PLPs contain ligands with a nucleoside core (e.g., ATP, UDP, NAD).
To remove the training data redundancy, we clustered the PLPs by ligand structure, protein pocket sequence similarity, and protein–ligand interaction similarity as described in Appendix D. As a result, quality-filtered PLPs were grouped into 40,148 (PLINDER train splt) or 53,205 (PoseX benchmark temporal split) clusters, nearly half of which were singletons.
Training Setup
The learning objective and other training protocol details are available in Appendix D.
Only PLPs that passed the quality filters were used for model training. We also excluded all entries that shared a PDB ID with any structure in the PoseBusters data set to enable potential additional benchmarking of the trained model.
For each training epoch, we randomly sampled only one PLP from each cluster. However, we observed that certain ligand CCD codes remained overrepresented, as they appeared in multiple binding modes across a wide range of proteins (e.g., HSR, ADP, HEM). Therefore, we randomly downsampled these to participate in at most 0.1% of sampled PLPs.
Additionally, we introduced pocket-based data augmentation during the model training as detailed in Appendix D.
Results and Discussion
Classical Docking Setup and Parametrization
Docking Parameters
The AutoDock-CPU experiments (Supplementary Table S1) revealed two parameters with considerable impact on the docking time/quality trade-off: maximum number of energy evaluations and number of independent genetic algorithm runs. Both parameters are directly proportional to the computational cost. Reducing the maximum number of energy evaluations from the default 2.5 million to 350,000 achieved near-peak performance at approximately 7-fold docking speedup. Doubling the default number of runs (10) considerably increased the fraction of predictions with RMSD < 2 Å from 0.42 to 0.46 (RMSD median decreased by 0.2 Å). As a result, we used up to 350,000 energy evaluations and 20 genetic algorithm runs for the final benchmarking of AutoDock-CPU on PLINDER tests.
The AutoDock-GPU experiments (Supplementary Table S2) demonstrated minor changes in both computational cost and accuracy when varying docking parameters. We relate it to the well-adjusted ligand-based heuristics to define the maximum number of energy evaluations and the early stopping algorithm implemented in the docking tool. Thus, AutoDock-GPU was launched with default parameters for further benchmarking.
Similarly to the AutoDock-GPU, Vina uses heuristics to define the maximum number of evaluations. Overriding the parameter with a predetermined value did not seem to improve performance in terms of speed-to-quality ratio (Supplementary Table S3). The docking exhaustiveness, which scales computational cost linearly (in a single process mode), reached near-peak accuracy at the value of 16 (default is 8). Consequently, we used an exhaustiveness of 16 for PLINDER benchmarking.
Among the three Glide precision modes (Appendix B), SP demonstrated the best performance (Supplementary Table S4). It produced 49% of predictions with an RMSD < 2 Å, substantially outperforming the faster HTVS mode (35%) and closely approaching the 51% achieved by the nearly 5 times slower XP mode. However, it is important to note that in terms of lDDT-PLI, a metric for evaluating protein–ligand interaction fidelity, the XP mode considerably outperformed SP, with a median score of 0.65 compared to 0.55. Therefore, XP may be a more suitable option when computational cost is not a limiting factor. When using SP mode, increasing the number of poses to use in postdocking minimization noticeably enhances prediction quality, adding only minor time overhead. Raising the number of poses from the default value of 5 to 80 reduced the median RMSD from 2.1 to 1.8 Å and improved the median lDDT-PLI score from 0.55 to 0.66, which is more accurate than XP mode. Further increase of the parameter value did not impact results. Considering these observations, Glide was launched in SP mode with the number of poses in postdocking minimization equal to 80 for PLINDER benchmarking.
Pocket Representations
Surprisingly, the ligand-aligned bounding box (LABB) representation, which incorporates prior knowledge about the shape of the ligand binding mode, did not outperform the simple cubic box in any of the classical docking methods tested. In fact, the cubic box demonstrated even better performance in AutoDock-CPU and Glide experiments (Figure and Supplementary Tables S1–S4).
Classical docking top-ranked pose RMSD median (Å, smaller is better) and lDDT-PLI score median (larger is better) on PoseBusters data set (N = 306) when a cubic (Cube) box of varying sizes or a ligand-aligned bounding box (LABB) with varying dimension increments is used as pocket input.
Overall, reducing the docking box size either improved or did not affect accuracy across all tools, except for Glide, which achieved peak lDDT-PLI performance using a 20 Å cubic box. This outcome is expected, as smaller pockets centered on the experimentally resolved ligand typically allow for a more exhaustive search in the correct region and reduce the likelihood of sampling highly scored binding modes located far from the true binding site.
However, this highlights an important trade-off between docking precision and the need for accurate pocket definition. In real-world applications, the exact position of the ligand is unknown, necessitating larger pockets to cover all plausible binding poses. To reflect this, we aimed to identify a shape-agnostic, sufficiently large pocket representation that still maintains high predictive accuracy for the final benchmarking. As a result, we selected the 20 Å cubic box as the standardized input for all methods, including ArtiDock, in the PLINDER benchmark evaluation.
PLINDER Benchmark
The ArtiDock model, trained on the PLINDER training set to prevent data leakage, outperforms classical physics-based docking methods on the PLINDER test set (Table). The machine learning approach predicts ligand poses with a median RMSD of 2.0 Å and a median lDDT-PLI score of 0.71, compared to 2.8 and 0.64 Å, respectively, for the best-performing traditional method, Glide.
1: Performance of Docking Methods on the PLINDER Test Dataset (N = 675)
Among the classical docking tools, Glide consistently delivers the best accuracy across most metrics, with Vina following closely behind. AutoDock-CPU, by contrast, shows the weakest performance with a median RMSD of 3.5 Å.
As is evident from Figure, traditional docking methods exhibit considerably wider RMSD distributions than ArtiDock, with higher peak lDDT-PLI values (∼0.9 vs ∼0.75). This indicates that they are occasionally able to predict very accurate poses, but at the same time produce noticeable amounts of low-quality ones. In contrast, ArtiDock produces many fewer low-quality predictions, resulting in better overall accuracy across the benchmark.
Violin plot of docking top-ranked pose RMSD (Å, smaller is better) and lDDT-PLI score (larger is better) distributions on the PLINDER test data set (N = 675). The bars represent the first and third quantiles, solid and dashed lines inside the bars indicate median and mean values, respectively. The shape represents the distribution of the values (kernel density plot).
The PoseBusters quality issues for all benchmarked tools predominantly originate from either too high internal energy of the predicted ligand pose or steric clashes with the pocket atoms (including heteroatoms), as shown in Supplementary Table S8. Considering physics-based approaches, the percentage of outputs with internal energy problems ranges from 5% in AutoDock-CPU/GPU to 1% in Glide. In contrast, Glide fails to avoid steric clashes in up to 7% of samples, while Vina produces clashes in only 1% of cases.
The fraction of ArtiDock predictions that pass the PB-Valid criterion is lower than that of the physics-based tools. However, applying a rapid postprediction optimization using either Vina or RDKit UFF improves this metric at the cost of slightly lowering RMSD precision. Notably, UFF minimization increases the fraction of chemically and geometrically valid poses by 22%, matching Vina PB-Valid performance. Even after this correction, ArtiDock continues to outperform classical docking methods across all other reported metrics.
In terms of computational efficiency, ArtiDock achieves high-throughput performance suitable for large-scale screening. Based on the average runtime per ligand and the hourly cost of the workstations used, we estimate that docking one million small molecules would cost approximately 12) and CPU-dependent Glide ($15). See Appendix E for full hardware specifications and cost estimates.
We were unable to obtain any predictions for some PLPs, as shown in column “N Failed” of Table. Most failed predictions from AutoDock-CPU/GPU and Vina are due to the issues during protein or ligand preprocessing, for example, the inability to process molecules containing selenium or metal atoms. Of the 66 Glide failures, 11 are attributed to similar ligand preprocessing errors, while the remaining failures are primarily due to the absence of acceptable binding modes or the elimination of all poses during grid energy minimization. Failures during UFF minimization typically occur when the protein pocket cannot be handled correctly by RDKit.
It is important to note that we did not penalize any method for failed predictions. All reported results are based on the subset of successfully docked PLPs.
Docking with Heteroatoms
The PLINDER structural data retains nonprotein entities present in the original PDB structures, such as cofactors, ions, and ligand-bound water molecules (Appendix A). We included all heteroatoms within the input docking pocket and tracked docking performance across subsets of the data containing different types of these entities.
In general, the highest docking accuracy was observed in pockets containing organic cofactors (Figure), although this may be influenced by the relatively small sample size of just 23 PLPs.
Docking top-ranked pose RMSD median (Å, smaller is better), lDDT-PLI score median (larger is better), and PB-Valid fraction (larger is better) on the PLINDER test data subsets with no heteroatoms (N = 189), with cofactors (N = 23), with ions (N = 123), or ligand-bound water molecules (N = 403) in a binding pocket.
ArtiDock consistently outperformed classical docking methods across all subsets, showing the greatest improvements in pockets containing ions and water. Specifically, ArtiDock achieved a median RMSD of 2.0 Å in ion-containing pockets (vs 2.8 Å for the best classical method) and 1.9 Å in water-containing pockets (vs 2.6 Å for the best classical method). According to PoseBusters’ quality checks, UFF-based optimization of ArtiDock predictions is on par with or outperforms the most successful classical approach, while Vina-based minimization yielded worse results.
Among physics-based tools, Glide provided the best precision in pockets containing cofactors, water, or protein only. Vina is the most accurate method in ion-containing pockets and shows the most reliable chemical validity across all subsets, with a fraction of PB-Valid results ranging from 0.94 to 0.96 (Figure).
The extended list of metrics is available in Supplementary Table S5.
PLINDER MLSB Benchmark
In contrast to the PLINDER test, which uses bound (holo) protein structures, the PLINDER MLSB subset is designed to evaluate docking tools under more realistic conditions, using apo or predicted protein inputs. This setting leads to a substantial drop in performance, with most methods showing up as a 2-fold increase in median RMSD (Table).
2: Performance of Docking Methods on the PLINDER MLSB Test Dataset (N = 344)
As in the holo benchmark, ArtiDock consistently outperforms all classical methods, achieving the lowest median RMSD (3.4 Å) and highest median lDDT-PLI (0.51). The accuracy score distributions in the PLINDER MLSB test show that ArtiDock predicts a greater number of high-quality poses compared to other methods. In particular, ArtiDock’s lDDT-PLI score distribution peaks around 0.55, whereas Glide peaks at approximately 0.25, with even lower values for the other methods (Figure).
Violin plot of docking top-ranked pose RMSD (Å, smaller is better) and lDDT-PLI score (larger is better) distributions on the PLINDER MLSB test data set (N = 344). The bars represent the first and third quantiles, solid and dashed lines inside the bars indicate median and mean values, respectively (box plot). The shape represents the spread of values for a given method (kernel density plot).
Among the traditional tools, Glide maintains the best overall performance. Notably, Vina shows the steepest decline in accuracy, dropping from the second-best physics-based method in the PLINDER test to the worst performer in the PLINDER MLSB evaluation (Table). We partially attribute poor Vina performance on unbound structures to knowledge-based potentials used in the scoring function? that might cause some overfit to holo pocket representations.
The PB-Valid rate for ArtiDock in the MLSB test is up to 7 percentage points lower than in the holo test, regardless of postprediction minimization. We attribute this decrease to cases where the ligand cannot physically fit into the unbound structure due to significant changes in pocket geometry. While classical docking methods tend to displace the ligand outside the binding box in response to steric clashes, ArtiDock predicts a distance matrix that may not fully account for such spatial incompatibilities but more accurately reflects the ground-truth binding mode.
Because ArtiDock inference pipeline prioritizes preserving predicted protein–ligand interatomic distances, the resulting poses in apo or predicted structures are more likely to exhibit steric issues. This trend is further supported by analysis involving the RMSD between superimposed Cα atoms in holo and apo/predicted pockets As shown in Supplementary Figure S1 and Table S6, the ArtiDock nonminimized PB-Valid fraction decreases from 0.81 in the subset with less than 0.1 to 0.53 Å in the subset with more than 0.5 Å. In contrast, classical methods show no considerable change in validity across these subsets.
When considering accuracy metrics, no correlation between and either RMSD or lDDT-PLI was found (Supplementary Table S7). There was no noticeable decline in accuracy observed up to of 0.5Å. Nevertheless, larger deviations beyond this threshold result in a noticeable drop in RMSD and lDDT-PLI performance for all methods (Supplementary Figure S1 and Table S6). However, the part of PLPs with constitutes only 11.6%. To fully assess the effect of unbound structural deviations on docking performance, additional benchmarks with more structurally distinct apo/predicted proteins are needed.
PoseX Benchmark
The PoseX self-docking (holo) benchmark revealed that ArtiDock performs comparably to the highest-ranked methods (Figure) while demonstrating higher throughput (Appendix E). While AI cofolding approaches yielded substantially lower accuracy than ArtiDock and other top-performing AI docking tools, this discrepancy likely reflects a greater degree of overfitting among AI docking models to the holo protein conformations present in the time-split benchmark, rather than an intrinsic superiority of docking over cofolding strategies in terms of accuracy.
Performance of the docking and cofolding methods on the PoseX self-docking data set (N = 718).
ArtiDock successfully reproduced 73.5% of complexes with an RMSD < 2 Å, closely matching the top-performing method, SurfDock? (77.6%). When structural validity was also required (RMSD < 2 Å and PB-Valid), ArtiDock achieved state-of-the-art performance with 63.9% of successful predictions. A fast postprediction optimization using either Vina or UFF further improved this score up to 66.4%.
As expected, AI cofolding methods exhibited a relative performance gain in the cross-docking (apo) scenario as they do not depend on rigid pocket representation of apo conformations (Figure). For the same reason, the flexible docking approach Glide IFD maintained its accuracy, in contrast to other physics-based methods. Nevertheless, none of the cofolding models substantially outperformed ArtiDock in terms of the fraction of predictions with RMSD < 2 Å.
Performance of the docking and cofolding methods on the PoseX cross-docking data set (N = 1312).
ArtiDock maintained state-of-the-art results, achieving 62.4% of complexes that were both accurate and PoseBusters-valid, followed by Boltz-1x? (60.2%). In contrast, postoptimization with Vina or UFF slightly degraded ArtiDock’s performance in this setting, likely due to the difficulty of physics-based refinements in identifying energetically favorable poses within unbound protein pockets.
Another important observation arises from pocket similarity analysis between crystal structures released before 2022 on RCSB PDB and PoseX test samples. Based on TM-scores and the methodology from the original study,? we identified no correlation between ligand RMSD and pocket TM-scores for ArtiDock (Supplementary Figures S2 and S3). The Pearson correlation coefficients were −0.065 and −0.044 for cross-docking and self-docking, respectively. When stratified by similarity (using a TM-score threshold of 0.7), the average ArtiDock ligand RMSD for “dissimilar” pockets in the self-docking data set was 2.0 Å, compared to 1.8 Å for “similar” pockets. Interestingly, this trend was inverted in the cross-docking scenario, where “dissimilar” pockets yielded a lower average RMSD (1.7 Å) than “similar” (2.0 Å).
This similarity impact contrasts sharply with other AI tools, which tend to show considerable performance gains on test pockets that resemble training data. Notably, in cross-docking scenarios, AI tools (especially cofolding) predominantly exhibit correlations between −0.3 and −0.55. In this regard, ArtiDock aligns more closely with physics-based methods, which are considerably less sensitive to pocket novelty.?
Limitations and Perspectives
The current evaluation of ArtiDock was benchmarked against traditional docking tools on the PLINDER data set. We ensured rigorous evaluation standards, which require retraining ML models on the PLINDER training split. This limits our capacity to extensively benchmark alternative ML docking methods, especially heavyweight diffusion models and cofolding architectures. We invite other developers of ML-based docking tools to utilize and extend the PLINDER benchmark, thus contributing to a more comprehensive assessment of ML docking performance.
A broader range of classical docking tools, such as the commercially available CCDC GOLD? and DOCK,? emerging open-source solutions like PandaDock,? and various Vina-based alternatives, ?,?−? ? should also be evaluated to cover the majority of physics-based docking techniques and to identify areas of improvement for their ML counterparts.
Furthermore, while this study focuses on rigid-receptor docking optimized for speed, we acknowledge the importance of receptor flexibility. Future iterations of ArtiDock should explore handling flexible binding sites, drawing inspiration from established flexible docking approaches like MedusaDock ?,? and Glide IFD,? to further extend benchmark and improve performance on unbound structures.
Chemical validity remains a concern in most ML-based docking approaches, including ArtiDock. Despite its high predictive accuracy, ArtiDock outputs often require additional physics-based postprocessing steps to ensure chemical validity. Future efforts should be directed toward improving ArtiDock’s internal mechanisms to inherently produce chemically valid ligand poses, thus reducing or eliminating reliance on external physics-based adjustments.
Since there is no scoring function, which is assessed during the pose prediction, ArtiDock is not suitable for ranking alternative poses of the same ligand or comparing different ligands in terms of the binding strength, like conventional docking techniques do. The poses generated by ArtiDock are optimal for the given ligand and given conformation of the binding pocket, but their scoring should be performed externally by either algorithmic scoring functions (Vina/UFF postprocessing steps automatically provide energy estimates) or ML scoring models.
In a complete virtual screening workflow, ArtiDock functions as the docking engine. While it does not perform exhaustive conformational sampling, its regression-based approach directly targets the bioactive state, offering a speed advantage over sampling-based methods.
We acknowledge that scoring is critical for virtual screening, and many recent approaches address this by predicting binding affinity directly from structure (e.g., NeuralDock,? PLANET,? 3DProtDTA?) or even from sequence alone (e.g., Yuel,? DeepDTA,? MONN?). To streamline and enhance predictive accuracy in high-throughput virtual screening scenarios, the development of a dedicated, ML-based scoring function tightly integrated within ArtiDock is necessary. Such an integrated scoring approach would likely provide enhanced applicability in real-world drug discovery applications.
Conclusions
In this study, we presented ArtiDock, an ML-based approach for protein–ligand docking, which addresses the shortcomings of classical docking methodologies and is optimized to high-throughput virtual screening scenarios. The evaluation, conducted on the recently developed PLINDER benchmark data set, demonstrates that ArtiDock predicts the ligand poses considerably more accurately in comparison to traditional physics-based docking techniques, reaching a median RMSD of 2.0 Å and median lDDT-PLI of 0.71 compared to 2.8 and 0.64 Å, respectively, for the best-performing classical method, Glide.
ArtiDock’s lightweight model architecture and optimized inference pipeline allow it to reach the docking throughput, which is on par or better than the fastest conventional tools. This positions ArtiDock as a promising candidate for high-throughput virtual screening scenarios, overcoming one of the primary obstacles that have previously hindered widespread adoption of ML docking approaches.
ArtiDock is consistently more accurate than traditional techniques in various real-world docking scenarios, including pockets containing cofactors, ions, and water molecules. Notably, the pockets containing ions and water molecules showed especially pronounced accuracy gain, emphasizing ArtiDock’s robustness and adaptability to complex biologically relevant binding modes.
ArtiDock maintains a distinct advantage on the PLINDER MLSB data set that includes especially challenging ligand-unbound protein structures, which deviate from the baseline ligand-bound conformations.
The integration of a quick, physics-based postprocessing step substantially elevates the fraction of chemically valid poses to levels comparable to traditional docking approaches without compromising predictive accuracy or computational complexity substantially.
In the time-split PoseX benchmark, ArtiDock delivered competitive accuracy compared to generally slower top-performing ML tools and achieved state-of-the-art results when PoseBusters’ validity was accounted for.
Overall, ArtiDock represents an advancement toward the much anticipated paradigm shift in docking toward ML-based tools, which effectively combines high throughput, robust predictive performance, and adaptability to realistic docking scenarios.
The future work should be primarily related to the inclusion of other classical and ML docking approaches to the current benchmark, the improvement of chemical and geometrical validity of ArtiDock output, and the implementation of ArtiDock’s native scoring function.
Supplementary Material
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ding F.Yin S.Dokholyan N. V.Rapid Flexible Docking Using a Stochastic Rotamer Library of Ligands J. Chem. Inf. Model.2010501623163210.1021/ci 100218 t 20712341 PMC 2947618 · doi ↗ · pubmed ↗
- 2Wang J.Dokholyan N. V.Medusa Dock 2.0: Efficient and Accurate Protein–Ligand Docking With Constraints J. Chem. Inf. Model.2019592509251510.1021/acs.jcim.8b 0090530946779 PMC 6597311 · doi ↗ · pubmed ↗
- 3Sherman W.Day T.Jacobson M. P.Friesner R. A.Farid R.Novel Procedure for Modeling Ligand/Receptor Induced Fit Effects J. Med. Chem.20064953455310.1021/jm 050540 c 16420040 · doi ↗ · pubmed ↗
- 4Trott O.Olson A. J.Auto Dock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading J. Comput. Chem.20103145546110.1002/jcc.2133419499576 PMC 3041641 · doi ↗ · pubmed ↗
- 5Eberhardt J.Santos-Martins D.Tillack A. F.Forli S.Auto Dock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings J. Chem. Inf. Model.2021613891389810.1021/acs.jcim.1c 0020334278794 PMC 10683950 · doi ↗ · pubmed ↗
- 6Li X.Li Y.Cheng T.Liu Z.Wang R.Evaluation of the performance of four molecular docking programs on a diverse set of protein-ligand complexes J. Comput. Chem.2010312109212510.1002/jcc.2149820127741 · doi ↗ · pubmed ↗
- 7Ghasemi, J. B. ; Abdolmaleki, A. ; Shiri, F. Molecular Docking Challenges and Limitations. In Pharmaceutical Sciences: Breakthroughs in Research and Practice; IGI Global, 2017; pp 770–794.
- 8Méndez-Lucio O.Ahmad M.del Rio-Chanona E. A.Wegner J. K.A geometric deep learning approach to predict binding conformations of bioactive molecules Nat. Mach. Intell.202131033103910.1038/s 42256-021-00409-9 · doi ↗
