Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning
Marco Jiralerspong, Esther Derman, Danilo Vucetic, Nikolay Malkin, Bilun Sun, Tianyu Zhang, Pierre-Luc Bacon, Gauthier Gidel

TL;DR
This paper introduces a new RL-based method with a unified operator and robustness perspective to improve the quality and diversity of candidate generation in large search spaces, such as proteins or molecules.
Contribution
It proposes a novel unified operator for regularized RL that better targets peakier distributions and introduces a robust RL framework for filtering candidates.
Findings
TGM outperforms baselines in synthetic tasks
TGM identifies higher quality candidates in real-world applications
The method enhances diversity and robustness of candidate generation
Abstract
A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper addresses an important problem—developing a reinforcement learning method that can generate diverse responses while also producing high-scoring candidates useful for biological and chemical discovery. 2. The paper is well written and generally easy to follow.
1. It seems that several important baseline comparisons are missing (see questions for details). 2. I’m not sure it’s appropriate to compare GFN with TGM, since GFN involves training a Q-network and therefore requires significantly more computation. The compute budget should be made fair for a valid comparison (see questions for details). 3. I also find the motivating example somewhat unclear or unconvincing (see questions for details). 4. Some of the figures are hard to understand.
- The proposed methods are novel and theoretically grounded, with a clear motivation. - The numerical experiments are comprehensive, showing the relevance of TGM
- The writing is pretty challenging to follow. - The motivating examples of "high reward path dominated by many low reward paths" seem to be the issue of the designed target distribution rather than the problem of the learning algorithm. For a 0-1 reward, if we reduce the temperature of the distribution to be proportional to exp(r/gamma) where gamma -> 0, the valid target examples are only going to be concentrated on positive reward ones, and the cases presented in the motivating examples no lon
This paper is original in unifying multiple soft RL operators through the proposed general mellowmax framework. The proposed trajectory general mellowmax is significant in bridging GFlowNets and robust RL under a common theoretical perspective. The paper contains rigorous mathematical derivations, a well motivated algorithmic design, and thorough experiments across both synthetic and real wold biological design tasks. The problem motivation, operator formulation, and empirical findings are clear
The mathematical notation can feel a bit dense at times and I found myself getting occasionally lost on what the various key parameters are meant to govern. It would be helpful to have a better illustration or explanation on the role of alpha, q, and omega in TGM - this is attempted in Figure 3, but requires the reader to go to other portions of the paper to understand what the axes are describing. So in general, I feel that the presentation and clarity of the paper could be improved. I also fee
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics · Machine Learning in Materials Science
