Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning

Marco Jiralerspong; Esther Derman; Danilo Vucetic; Nikolay Malkin; Bilun Sun; Tianyu Zhang; Pierre-Luc Bacon; Gauthier Gidel

arXiv:2506.17007·cs.LG·October 13, 2025

Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning

Marco Jiralerspong, Esther Derman, Danilo Vucetic, Nikolay Malkin, Bilun Sun, Tianyu Zhang, Pierre-Luc Bacon, Gauthier Gidel

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new RL-based method with a unified operator and robustness perspective to improve the quality and diversity of candidate generation in large search spaces, such as proteins or molecules.

Contribution

It proposes a novel unified operator for regularized RL that better targets peakier distributions and introduces a robust RL framework for filtering candidates.

Findings

01

TGM outperforms baselines in synthetic tasks

02

TGM identifies higher quality candidates in real-world applications

03

The method enhances diversity and robustness of candidate generation

Abstract

A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper addresses an important problem—developing a reinforcement learning method that can generate diverse responses while also producing high-scoring candidates useful for biological and chemical discovery. 2. The paper is well written and generally easy to follow.

Weaknesses

1. It seems that several important baseline comparisons are missing (see questions for details). 2. I’m not sure it’s appropriate to compare GFN with TGM, since GFN involves training a Q-network and therefore requires significantly more computation. The compute budget should be made fair for a valid comparison (see questions for details). 3. I also find the motivating example somewhat unclear or unconvincing (see questions for details). 4. Some of the figures are hard to understand.

Reviewer 02Rating 6Confidence 2

Strengths

- The proposed methods are novel and theoretically grounded, with a clear motivation. - The numerical experiments are comprehensive, showing the relevance of TGM

Weaknesses

- The writing is pretty challenging to follow. - The motivating examples of "high reward path dominated by many low reward paths" seem to be the issue of the designed target distribution rather than the problem of the learning algorithm. For a 0-1 reward, if we reduce the temperature of the distribution to be proportional to exp(r/gamma) where gamma -> 0, the valid target examples are only going to be concentrated on positive reward ones, and the cases presented in the motivating examples no lon

Reviewer 03Rating 6Confidence 3

Strengths

This paper is original in unifying multiple soft RL operators through the proposed general mellowmax framework. The proposed trajectory general mellowmax is significant in bridging GFlowNets and robust RL under a common theoretical perspective. The paper contains rigorous mathematical derivations, a well motivated algorithmic design, and thorough experiments across both synthetic and real wold biological design tasks. The problem motivation, operator formulation, and empirical findings are clear

Weaknesses

The mathematical notation can feel a bit dense at times and I found myself getting occasionally lost on what the various key parameters are meant to govern. It would be helpful to have a better illustration or explanation on the role of alpha, q, and omega in TGM - this is attempted in Figure 3, but requires the reader to go to other portions of the paper to understand what the axes are describing. So in general, I feel that the presentation and clarity of the paper could be improved. I also fee

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics · Machine Learning in Materials Science