ROSARL: Reward-Only Safe Reinforcement Learning

Geraud Nangue Tasse; Tamlin Love; Mark Nemecek; Steven James; Benjamin; Rosman

arXiv:2306.00035·cs.LG·June 2, 2023·1 cites

ROSARL: Reward-Only Safe Reinforcement Learning

Geraud Nangue Tasse, Tamlin Love, Mark Nemecek, Steven James, Benjamin, Rosman

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a reward-only safe reinforcement learning method that determines an upper reward bound to minimize unsafe state reachability, using environment controllability and diameter, enabling safe policy learning without explicit penalty design.

Contribution

It proposes the concept of Minmax penalty for safe RL, providing a model-free algorithm to learn this bound during task learning, improving safety in high-dimensional environments.

Findings

01

The Minmax penalty effectively bounds unsafe state reachability.

02

The algorithm learns safe policies in complex continuous control tasks.

03

Using Minmax penalty accelerates safe policy convergence.

Abstract

An important problem in reinforcement learning is designing agents that learn to solve tasks safely in an environment. A common solution is for a human expert to define either a penalty in the reward function or a cost to be minimised when reaching unsafe states. However, this is non-trivial, since too small a penalty may lead to agents that reach unsafe states, while too large a penalty increases the time to convergence. Additionally, the difficulty in designing reward or cost functions can increase with the complexity of the problem. Hence, for a given environment with a given set of unsafe states, we are interested in finding the upper bound of rewards at unsafe states whose optimal policies minimise the probability of reaching those unsafe states, irrespective of task rewards. We refer to this exact upper bound as the "Minmax penalty", and show that it can be obtained by taking into…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The paper's strengths lie in providing an alternative to unstable policy optimization methods by introducing a penalty term that eliminates the need for such approaches. The paper is well-presented and generally sound, albeit with some flaws. The analysis in a lower-dimensional environment, as well as the comparison between the performance of the practical algorithm and the environment where the method's assumption holds, is quite helpful.

Weaknesses

The paper derives its penalty term using the concepts of diameter and solvability, which require knowledge of the dynamics. In the practical implementation of their method, which does not require knowledge of the dynamics. The empirical experiments are on the weaker side. The method underperforms Lagrangian TRPO in task performance. Also, the paper compares their method only with a single threshold; further, the method does not compare their approach with the PID Lagrangian method, which is SOTA

Reviewer 02Rating 6Confidence 4

Strengths

1. Reframing the episodic safety problem from a complex "constrained optimization" framework (CMDPs) back to a "reward design" problem is an insightful perspective. 2. The proposed practical algorithm is simple and easy to implement. It does not require manual tuning of hyperparameters. It can be used as a "plug-in" with any off-the-shelf, value-based RL algorithm, offering strong generality. 3. The method shows excellent performance in the Safety Gym experiments, particularly under high-noise

Weaknesses

1. The entire theoretical framework is explicitly built on "undiscounted stochastic shortest path" (SSP) MDPs. However, the core experiments used to validate the algorithm are conducted in "discounted," continuous-control, non-SSP environments. This makes the connection between the theoretical derivations and the experimental results weak. 2. The practical algorithm completely omits the solvability factor C from the theoretical bound. The authors claim the adaptive nature of the algorithm "imp

Reviewer 03Rating 2Confidence 3

Strengths

The proposed security reinforcement learning algorithm has certain theoretical significance.

Weaknesses

1. It has not been compared with existing reachability methods, which adopt the idea of minimax optimization. 2. The proposed method cannot guarantee absolute security of the strategy in theory or practice, which is crucial for secure reinforcement learning. 3. The comparison algorithm is relatively outdated.

Code & Models

Repositories

geraudnt/rosarl
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics