Multi-domain Distribution Learning for De Novo Drug Design

Arne Schneuing; Ilia Igashov; Adrian W. Dobbelstein; Thomas Castiglione; Michael Bronstein; Bruno Correia

arXiv:2508.17815·cs.LG·August 26, 2025

Multi-domain Distribution Learning for De Novo Drug Design

Arne Schneuing, Ilia Igashov, Adrian W. Dobbelstein, Thomas Castiglione, Michael Bronstein, Bruno Correia

PDF

3 Reviews

TL;DR

DrugFlow is a novel generative model for structure-based drug design that combines flow matching and Markov bridges, enabling accurate, uncertainty-aware sampling of protein-ligand conformations with improved distribution learning.

Contribution

The paper introduces DrugFlow, integrating continuous flow matching with Markov bridges, and extends it to explore protein conformations, achieving state-of-the-art performance in multi-domain drug design.

Findings

01

State-of-the-art performance in chemical, geometric, and physical data learning.

02

Effective uncertainty estimation for out-of-distribution detection.

03

Enhanced sampling of protein conformational landscapes.

Abstract

We introduce DrugFlow, a generative model for structure-based drug design that integrates continuous flow matching with discrete Markov bridges, demonstrating state-of-the-art performance in learning chemical, geometric, and physical aspects of three-dimensional protein-ligand data. We endow DrugFlow with an uncertainty estimate that is able to detect out-of-distribution samples. To further enhance the sampling process towards distribution regions with desirable metric values, we propose a joint preference alignment scheme applicable to both flow matching and Markov bridge frameworks. Furthermore, we extend our model to also explore the conformational landscape of the protein by jointly sampling side chain angles and molecules.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 5

Strengths

- The paper is articulated clearly and concisely. - Figures and tables effectively present complex data and comparisons, enhancing accessibility for readers. - The methodology is detailed thoroughly, supporting reproducibility.

Weaknesses

The primary concern lies in the paper's technical soundness: 1. The treatment of the pocket is inadequately detailed—it is unclear whether the pocket is generated jointly with the molecule or used as context. 2. In Section 2.1, the uncertainty estimation involves several ambiguities: - In line #133, the assumption of the error being normally distributed is neither evident nor justified. - In line #143, $\dot{x}_t$ is inaccurately referred to as a ground truth vector field; it shou

Reviewer 02Rating 8Confidence 4

Strengths

- The model considers side-chain flexibility, which is critical in ligand docking and design as receptors are mostly non-rigid. The side-chain flexibility issue has also been overlooked in previous SBDD methods until this work, to the best of my knowledge. - This model provides an estimate of uncertainty, which is improtant in molecular modeling area and can increase the practicality of the method. Uncertainty estimation has been a common practice in structure prediction settings, but it has als

Weaknesses

- Does the evaluation presented in Section 3.1 consider side-chain flexibility? It seems that the DrugFlow and FlexFlow are separate variants and only the FlexFlow considers side-chain flexiblity. - If Section 3.1 does not model side-chain flexibility, why not? Did the authors consider jointly sampling both ligand structures and side-chain torsional angles?

Reviewer 03Rating 6Confidence 3

Strengths

This paper had a lot of strong positives but also some strong negatives. Starting with the positives: - Good knowledge of the field: unlike many ML papers in this area, this work has no statements about drug discovery that seemed to portray an embarrassing lack of domain knowledge on behalf of the authors. I also agree with the assessment that many works train models for distribution matching and then evaluate them for optimization, which does not make sense - The end-to-end uncertainty estimat

Weaknesses

In my opinion, the biggest weaknesses of this paper all come from the experiments. I've organized them under the following headings ### You might not be measuring the right things Essentially all metrics in the paper are about how well the distribution of molecules generated by the model matches the training distribution. However: - Only _marginal_ (1D) distributions seem to be measured, rather than _joint_ distributions of properties (i.e. does the joint distribution of SAscore and logP look

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.