Skill-based Safe Reinforcement Learning with Risk Planning

Hanping Zhang; Yuhong Guo

arXiv:2505.01619·cs.LG·May 6, 2025

Skill-based Safe Reinforcement Learning with Risk Planning

Hanping Zhang, Yuhong Guo

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel Safe Skill Planning approach that leverages offline demonstration data and risk prediction to improve safety and efficiency in reinforcement learning for robotics.

Contribution

The paper proposes a two-stage Safe Skill Planning method combining PU learning and risk-aware planning to enhance safe RL with offline data and online adaptation.

Findings

01

Outperforms previous safe RL methods in benchmark environments

02

Effectively integrates offline demonstration data for risk prediction

03

Achieves safer and more efficient policy learning

Abstract

Safe Reinforcement Learning (Safe RL) aims to ensure safety when an RL agent conducts learning by interacting with real-world environments where improper actions can induce high costs or lead to severe consequences. In this paper, we propose a novel Safe Skill Planning (SSkP) approach to enhance effective safe RL by exploiting auxiliary offline demonstration data. SSkP involves a two-stage process. First, we employ PU learning to learn a skill risk predictor from the offline demonstration data. Then, based on the learned skill risk predictor, we develop a novel risk planning process to enhance online safe RL and learn a risk-averse safe policy efficiently through interactions with the online RL environment, while simultaneously adapting the skill risk predictor to the environment. We conduct experiments in several benchmark robotic simulation environments. The experimental results…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

The paper investigates an interesting setting of using offline data to enable RL for online exploration in constrained settings. This topic is crucial for enabling RL to explore in constrained real-world scenarios. In this setting, the paper explores using skills in safe RL, which is a relatively unexplored combination. The paper provides a clear description of the algorithm, which aids in understanding the advantages and disadvantages of the proposed method.

Weaknesses

- The impact of using skills needs to be ablated. It is unclear whether using skills in this context is useful. Is it beneficial to only act on each H step in a constrained setting? The authors should ablate using the presented planning and policy learning combination on low-level actions. - The motivation behind using PU learning to learn the safety classifier is unclear. The authors use the classifier as a cost-to-go function, which would typically be learned as a safety critic using approxi

Reviewer 02Rating 3Confidence 3

Strengths

- The paper introduces the idea of using offline data for training skills and a risk predictor. - The paper is well-organized and easy to read.

Weaknesses

# Major Weaknesses - The contributions are relatively limited. - Each proposed module (skills [1,2], risk predictors [3, 4, 5]) is from existing methods. - If there were any novel training techniques, the author should have highlighted them, but it seems there is nothing new. - The motivation for the proposed method is ambiguous, and in particular, it is unclear which parts of the proposed method improve upon existing methods (lines 44–48). - Also, there is no theoretical an

Reviewer 03Rating 6Confidence 3

Strengths

- The paper proposes a new method, SSkP, that combines skill learning with risk planning for safe RL. - The approach uses a two-stage process that first learns from offline data and then applies it to online environments, which is efficient and reduces potential damage to physical environments. - The risk planning process is a simple yet effective method for generating safer skill decisions, enhancing safe exploration and learning. - The method adapts the skill risk predictor to online environme

Weaknesses

- The effectiveness of SSkP relies heavily on the quality and quantity of offline demonstration data, which may not always be available or reliable. - The two-stage process and the integration of multiple components (skill model, risk predictor, risk planning) might make the approach more complex to implement and understand. - The paper primarily focuses on robotic simulation environments, and it's unclear how well SSkP would generalize to other types of environments or real-world applications.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Software Engineering Methodologies · Safety Systems Engineering in Autonomy