Deep Confident Steps to New Pockets: Strategies for Docking Generalization
Gabriele Corso, Arthur Deng, Benjamin Fry, Nicholas Polizzi, Regina, Barzilay, Tommi Jaakkola

TL;DR
This paper introduces DockGen, a new benchmark for evaluating docking models' generalization, and proposes Confidence Bootstrapping, a training paradigm that enhances the ability of ML-based docking methods to generalize to unseen protein classes.
Contribution
The paper develops DockGen benchmark, analyzes scaling laws for ML docking models, and introduces Confidence Bootstrapping to improve generalization to unseen proteins.
Findings
Scaling data and model size improves generalization.
Synthetic data strategies significantly enhance performance.
Confidence Bootstrapping boosts docking accuracy on unseen proteins.
Abstract
Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, docking methods must generalize well across the proteome. Existing benchmarks, however, fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that existing machine learning-based docking models have very weak generalization abilities. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. Further, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between diffusion and confidence models and exploits the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Structure and Dynamics · Machine Learning in Bioinformatics · Computational Drug Discovery Methods
MethodsSparse Evolutionary Training · Diffusion
