Guiding Diffusion Models with Reinforcement Learning for Stable Molecule Generation

Zhijian Zhou; Junyi An; Zongkai Liu; Yunfei Shi; Xuan Zhang; Fenglei Cao; Chao Qu; Yuan Qi

arXiv:2508.16521·cs.LG·August 25, 2025

Guiding Diffusion Models with Reinforcement Learning for Stable Molecule Generation

Zhijian Zhou, Junyi An, Zongkai Liu, Yunfei Shi, Xuan Zhang, Fenglei Cao, Chao Qu, Yuan Qi

PDF

4 Reviews

TL;DR

This paper introduces RLPF, a reinforcement learning framework that guides diffusion models with physical feedback to generate more stable and physically accurate 3D molecular structures.

Contribution

It extends diffusion models with reinforcement learning using force-field based rewards to improve physical realism in molecule generation.

Findings

01

RLPF outperforms existing methods in molecular stability.

02

Incorporating physics-based rewards enhances structural accuracy.

03

The approach is validated on QM9 and GEOM-drug datasets.

Abstract

Generating physically realistic 3D molecular structures remains a core challenge in molecular generative modeling. While diffusion models equipped with equivariant neural networks have made progress in capturing molecular geometries, they often struggle to produce equilibrium structures that adhere to physical principles such as force field consistency. To bridge this gap, we propose Reinforcement Learning with Physical Feedback (RLPF), a novel framework that extends Denoising Diffusion Policy Optimization to 3D molecular generation. RLPF formulates the task as a Markov decision process and applies proximal policy optimization to fine-tune equivariant diffusion models. Crucially, RLPF introduces reward functions derived from force-field evaluations, providing direct physical feedback to guide the generation toward energetically stable and physically meaningful structures. Experiments on…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

The paper addresses a critical and practical problem in 3D molecular generation: the physical instability of generated structures. The demonstrated improvements in downstream efficiency (e.g., a 44% time reduction in rejection sampling (Table 8) and 36% fewer optimization steps (Table 11)) are significant and highly valuable for practitioners. The work successfully demonstrates that adapting an existing RL framework (DDPO) with a domain-specific, physics-based reward is a viable and effective s

Weaknesses

The core RL algorithm is a direct application of the existing DDPO framework. The proposed "Size-Invariant Log-Likelihood Estimation" (Sec 4.5) appears to be a standard masking technique used in sequence processing. Furthermore, using physical properties as reward signals is a known concept in other areas of molecular generation (e.g., with autoregressive models). The primary contribution is the combination of these existing ideas, rather than a novel RL algorithm or reward formulation. The pap

Reviewer 02Rating 2Confidence 3

Strengths

1.An explicit reward function (forced RMS) oriented towards "structural physical feasibility" is used to cast the diffusion trajectory into an MDP and train it stably in a PPO style. This approach is reasonable and highly consistent with the domain requirements for stable conformations. 2.It covers multiple backbones and various ablation methods, including reward type, sampling steps, pruning threshold, etc., and includes a control group for continued training and RL, demonstrating a strong awa

Weaknesses

1.There is a lack of systematic comparisons with energy/force field guided sampling (non-RL) methods (such as energy guidance, score distillation with energy, and consistency comparisons with molecular force field post-processing); the condition generation uses a predictor ω to evaluate target properties, which carries the risk of evaluator coupling or bias. 2.Equation (12) defines the RMS of the force vector (the unit should be eV/Å or kJ/mol/Å). The text repeatedly refers to it as "RMSD of fo

Reviewer 03Rating 2Confidence 3

Strengths

- I am not aware of works specifically focused on fine-tuning diffusion models for improving molecular stability (although I am not an expert in this domain), which to my knowledge is an important task from a chemical standpoint. - The paper is easy to follow and sufficiently formal to be clear. - Experimental evaluations encompass a wide range of pre-trained models as baselines.

Weaknesses

- (main concern) although the paper shows an interesting application, I believe that there isn't nearly any core methodological innovation. Concretely, I would not regard RLPF as a 'novel framework' as stated within the abstract and along the paper, but rather just a specific choice of reward function and lower-level engineering/implementation details. The algorithmic machinery employed is already very well established in this field after it has been introduced ~3 years ago with hundreds of wor

Reviewer 04Rating 2Confidence 4

Strengths

- This paper is well-written and well-structured. - The authors incorporate physics-informed reinforcement learning to guide diffusion models to generate physically plausible outputs. - The authors validate the effectiveness of the proposed approach by showing an improved performance across different datasets, and combined with different pretrained diffusion models.

Weaknesses

- The paper offers limited methodological novelty. It primarily builds on top of prior work [1], that adapts DDPO to optimize pre-trained diffusion models for various downstream objectives. This paper simply applies the same methodology to the molecular domain, using a different objective function defined over generated 3D molecular structures. - The impact of the masking mechanism introduced in Equation 13 on performance remains unclear. Could the authors provide additional experiments to supp

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.