Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

Xuan Li; Zhanke Zhou; Zongze Li; Jiangchao Yao; Yu Rong; Lu Zhang; Bo Han

arXiv:2603.05900·cs.LG·March 9, 2026

Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

Xuan Li, Zhanke Zhou, Zongze Li, Jiangchao Yao, Yu Rong, Lu Zhang, Bo Han

PDF

Open Access 3 Reviews

TL;DR

This paper introduces RePO, a novel reference-guided policy optimization method that enhances molecular optimization with LLMs by balancing exploration and exploitation, outperforming existing fine-tuning and reinforcement learning approaches.

Contribution

RePO is a new optimization approach that learns from reference molecules without trajectory data, combining RL exploration with supervised guidance for improved molecular optimization.

Findings

01

RePO outperforms SFT and RLVR baselines on benchmarks.

02

RePO achieves higher success rate and similarity scores.

03

RePO generalizes better to unseen instruction styles.

Abstract

Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step optimization trajectory. We reveal that answer-only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under similarity constraints due to the model's lack of effective exploration, which slows learning and limits optimization. To encourage the exploration of new molecules while balancing the exploitation of the reference molecules, we introduce Reference-guided Policy Optimization (RePO), an optimization approach that learns from reference molecules without requiring trajectory data. At each update, RePO…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The paper is very well-written and easy to understand. I like how the problem is first formulated and clearly defined in Section 2, followed by a discussion of specific limitations in Section 3. The proposed method is then presented clearly and succinctly in Section 4, making it easy for readers to follow and grasp the details. - DePO is an elegant fusion of RL and supervised signals, guiding exploration with domain demonstrations while maintaining reasoning depth. It demonstrates constraint

Weaknesses

- In the DePO scheme, the LLM output is first parsed into reasoning tokens and the generated final answer. Then, the generated final answer is replaced with the gold-standard final answer. This way, the reasoning tokens are preserved, and gradient masking for the intermediate reasoning steps excludes these tokens from parameter updates during optimization. The authors claim that this approach prevents the LLM from learning potentially erroneous reasoning patterns, but it is not clear why that is

Reviewer 02Rating 4Confidence 4

Strengths

The paper introduces a novel demonstration-guided reinforcement learning paradigm (DePO) that effectively bridges the gap between language-model reasoning and domain-constrained scientific optimization. It presents comprehensive experiments across multiple molecular optimization benchmarks, providing strong empirical evidence for the framework’s effectiveness. The significance of this work lies in extending LLM reasoning beyond text and mathematics to molecular design, demonstrating that demonst

Weaknesses

1. The discussion of related work is limited. Several recent GPT-based molecular optimization studies are neither cited nor compared as baselines, making it difficult to contextualize DePO’s contributions within existing literature. 2. The novelty appears somewhat incremental, as the method primarily adds guiding exploration and a regularization term to the GRPO objective. The introduced regularization is heuristic in nature, and the paper does not provide sufficient discussion on how the hyper

Reviewer 03Rating 6Confidence 2

Strengths

1. Novel combination of demonstration and RL: Clear motivation and principled integration of supervised guidance into RL objective. 2. Strong empirical gains: Up to 13% improvement over SFT and GRPO on TOMG-Bench and MuMOInstruct, with convincing generalization and ablations. 3. Well-written and interpretable: The framework (gradient masking, demonstration term) is intuitive and illustrated clearly with chemical reasoning examples.

Weaknesses

1. Incremental contribution: Conceptually close to existing RLHF/RLVR + imitation learning hybrids; novelty may feel limited. 2. Limited scope: Only tested on molecular optimization—no evidence DePO generalizes to other scientific or reasoning domains. 3. Weak theoretical insight: Lacks formal analysis of convergence or policy improvement guarantees under demonstration bias.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Machine Learning and Data Classification