ROSARL: Reward-Only Safe Reinforcement Learning
Geraud Nangue Tasse, Tamlin Love, Mark Nemecek, Steven James, Benjamin, Rosman

TL;DR
This paper introduces a reward-only safe reinforcement learning method that determines an upper reward bound to minimize unsafe state reachability, using environment controllability and diameter, enabling safe policy learning without explicit penalty design.
Contribution
It proposes the concept of Minmax penalty for safe RL, providing a model-free algorithm to learn this bound during task learning, improving safety in high-dimensional environments.
Findings
The Minmax penalty effectively bounds unsafe state reachability.
The algorithm learns safe policies in complex continuous control tasks.
Using Minmax penalty accelerates safe policy convergence.
Abstract
An important problem in reinforcement learning is designing agents that learn to solve tasks safely in an environment. A common solution is for a human expert to define either a penalty in the reward function or a cost to be minimised when reaching unsafe states. However, this is non-trivial, since too small a penalty may lead to agents that reach unsafe states, while too large a penalty increases the time to convergence. Additionally, the difficulty in designing reward or cost functions can increase with the complexity of the problem. Hence, for a given environment with a given set of unsafe states, we are interested in finding the upper bound of rewards at unsafe states whose optimal policies minimise the probability of reaching those unsafe states, irrespective of task rewards. We refer to this exact upper bound as the "Minmax penalty", and show that it can be obtained by taking into…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper's strengths lie in providing an alternative to unstable policy optimization methods by introducing a penalty term that eliminates the need for such approaches. The paper is well-presented and generally sound, albeit with some flaws. The analysis in a lower-dimensional environment, as well as the comparison between the performance of the practical algorithm and the environment where the method's assumption holds, is quite helpful.
The paper derives its penalty term using the concepts of diameter and solvability, which require knowledge of the dynamics. In the practical implementation of their method, which does not require knowledge of the dynamics. The empirical experiments are on the weaker side. The method underperforms Lagrangian TRPO in task performance. Also, the paper compares their method only with a single threshold; further, the method does not compare their approach with the PID Lagrangian method, which is SOTA
1. Reframing the episodic safety problem from a complex "constrained optimization" framework (CMDPs) back to a "reward design" problem is an insightful perspective. 2. The proposed practical algorithm is simple and easy to implement. It does not require manual tuning of hyperparameters. It can be used as a "plug-in" with any off-the-shelf, value-based RL algorithm, offering strong generality. 3. The method shows excellent performance in the Safety Gym experiments, particularly under high-noise
1. The entire theoretical framework is explicitly built on "undiscounted stochastic shortest path" (SSP) MDPs. However, the core experiments used to validate the algorithm are conducted in "discounted," continuous-control, non-SSP environments. This makes the connection between the theoretical derivations and the experimental results weak. 2. The practical algorithm completely omits the solvability factor C from the theoretical bound. The authors claim the adaptive nature of the algorithm "imp
The proposed security reinforcement learning algorithm has certain theoretical significance.
1. It has not been compared with existing reachability methods, which adopt the idea of minimax optimization. 2. The proposed method cannot guarantee absolute security of the strategy in theory or practice, which is crucial for secure reinforcement learning. 3. The comparison algorithm is relatively outdated.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
