Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking
Thomas Le Menestrel, Manuel Rivas

TL;DR
Smiles2Dock provides a large-scale, multi-task dataset for ML-based molecular docking, facilitating benchmarking and development of new algorithms with over 25 million protein-ligand binding scores.
Contribution
The paper introduces Smiles2Dock, a comprehensive dataset and framework for ML-based docking, including a novel Transformer architecture for score prediction.
Findings
Created a dataset with 25 million binding scores from 1.7 million ligands and 15 proteins.
Benchmark results demonstrate the effectiveness of the new Transformer-based docking score predictor.
Dataset and code are publicly available for community use and further research.
Abstract
Docking is a crucial component in drug discovery aimed at predicting the binding conformation and affinity between small molecules and target proteins. ML-based docking has recently emerged as a prominent approach, outpacing traditional methods like DOCK and AutoDock Vina in handling the growing scale and complexity of molecular libraries. However, the availability of comprehensive and user-friendly datasets for training and benchmarking ML-based docking algorithms remains limited. We introduce Smiles2Dock, an open large-scale multi-task dataset for molecular docking. We created a framework combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL database against 15 AlphaFold proteins, giving us more than 25 million protein-ligand binding scores. The dataset leverages a wide range of high-accuracy AlphaFold protein models, encompasses a diverse set of biologically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Chemical Synthesis and Analysis · Click Chemistry and Applications
MethodsAttention Is All You Need · Sparse Evolutionary Training · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Multi-Head Attention
