GUARD: A Safe Reinforcement Learning Benchmark
Weiye Zhao, Yifan Sun, Feihan Li, Rui Chen, Ruixuan Liu, Tianhao Wei,, Changliu Liu

TL;DR
GUARD is a comprehensive benchmark designed to evaluate and compare safe reinforcement learning algorithms across diverse tasks and constraints, facilitating progress in safety-critical applications.
Contribution
The paper introduces GUARD, a versatile and comprehensive benchmark for safe RL, including diverse algorithms, tasks, and safety constraints, with self-contained implementations.
Findings
Benchmark enables fair comparison of safe RL algorithms.
Baseline results established for various task settings.
GUARD's flexibility supports future safe RL research.
Abstract
Due to the trial-and-error nature, it is typically challenging to apply RL algorithms to safety-critical real-world applications, such as autonomous driving, human-robot interaction, robot manipulation, etc, where such errors are not tolerable. Recently, safe RL (i.e. constrained RL) has emerged rapidly in the literature, in which the agents explore the environment while satisfying constraints. Due to the diversity of algorithms and tasks, it remains difficult to compare existing safe RL algorithms. To fill that gap, we introduce GUARD, a Generalized Unified SAfe Reinforcement Learning Development Benchmark. GUARD has several advantages compared to existing benchmarks. First, GUARD is a generalized benchmark with a wide variety of RL agents, tasks, and safety constraint specifications. Second, GUARD comprehensively covers state-of-the-art safe RL algorithms with self-contained…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
(1) The paper is very clear and the benchmark is described incredibly thoroughly, I only had some minor confusion which can be adressed in the questions. (2) This benchmark is clearly a direct competitor with the likes of Safety Gym and newer benchmarks like Safety-Gymnasium (Ji et al. 2023) and Omnisafe (Ji et al. 2023). However, it distinguishes itself by having 8 completely different robot configurations and some new task and constraint configurations. (3) The problem is well motivated. And
(1) Only model-free algorithms are implemented in this benchmark. Model-based approaches to safe RL have grown in interest recently, since they exhibit better sample complexity, it would be nice to see maybe two popular model-based from the literature available here. (2) Little backwards compatibility with Safety Gym. Clearly Safety Gym (Ray et al. 2019) has inspired this work, it would be nice if the 3 robots and 3 environment configurations from Safety Gym were captured by GUARD. (3) The num
1. A unified benchmark is really needed for safe RL, especially, when `safety gym` is not maintained anymore 1. The authors presented several different tasks, which appear to be very customizable 1. The benchmark comes with implemented baseline algorithms
1. The tasks do not seem to be well-tuned or well-designed depending on the viewpoint. The target cost (which is equal to zero) is not achieved in many experiments, which suggests that the safe algorithms did not learn to be “safe”. I think designing the tasks and tuning the algorithms would be one of the main contributions of this work, but it seems to be lacking. 1. It would be great to provide some details on implementation (apologies if I missed them). For example, it would be good to know
- Generalization: GUARD provides a wide variety of agents, tasks, and safety constraint specifications, making it a versatile benchmark for testing safe RL algorithms. It accommodates diverse real-world scenarios, ensuring that research is not limited to specific domains. - Unification: GUARD promotes a unified platform for evaluating safe RL algorithms. By maintaining consistency in experiment setups, it facilitates reliable performance comparisons across different algorithms and controlled env
- In Section "4.2 UNCONSTRAINED RL", it is mentioned that TRPO is state-of-the-art, which is clearly not correct given recent improvements to RL policies like agent57 [1] or muzero [2]. - Closely related to the above point, except for USL, all other considered algorithms are for at least 2 years ago, which is in contradiction to what is mentioned in the abstract: "GUARD comprehensively covers state-of-the-art safe RL algorithms". I would strongly suggest including more recent algorithms. - A few
The paper's strength lies in the creation of GUARD, a pioneering benchmark in the field of safe reinforcement learning, which significantly surpasses existing benchmarks. GUARD's unique contributions include an extensive and generalized framework encompassing 11 different types of agents, 7 distinct robot locomotion tasks, and 8 safety constraint specifications. Furthermore, it offers a unified platform with comprehensive coverage of 8 state-of-the-art safe RL algorithms, all implemented with a
- Although I agree with the authors that there are two groups of safe RL methods, i.e., Hierarchical and end-to-end safe RL, I personally think separating the methods by theoretical guarantee should be a better choice. Also, on-policy or off-policy should also be a choice. - It would be great if the author could conclude a bit why the methods with theoretical guarantee for constraint satisfaction cannot satisfy the constraint at the early phase of training. - The paper covers 8 state-of-the-ar
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Autonomous Vehicle Technology and Safety · Adversarial Robustness in Machine Learning
