Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding

Daniel Bethell; Simos Gerasimou; Radu Calinescu; Calum Imrie

arXiv:2405.18180·cs.AI·August 27, 2025·1 cites

Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding

Daniel Bethell, Simos Gerasimou, Radu Calinescu, Calum Imrie

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper presents ADVICE, a novel adaptive shielding method for safe reinforcement learning in black-box environments, significantly reducing safety violations during training while maintaining competitive rewards.

Contribution

Introduces ADVICE, a post-shielding approach that identifies safe and unsafe features to enhance safety in black-box RL environments without prior domain knowledge.

Findings

01

Reduces safety violations by approximately 50% during training.

02

Maintains competitive reward performance compared to existing safe RL methods.

03

Effective in unknown, black-box environments without prior knowledge.

Abstract

Empowering safe exploration of reinforcement learning (RL) agents during training is a critical challenge towards their deployment in many real-world scenarios. When prior knowledge of the domain or task is unavailable, training RL agents in unknown, black-box environments presents an even greater safety risk. We introduce ADVICE (Adaptive Shielding with a Contrastive Autoencoder), a novel post-shielding technique that distinguishes safe and unsafe features of state-action pairs during training, and uses this knowledge to protect the RL agent from executing actions that yield likely hazardous outcomes. Our comprehensive experimental evaluation against state-of-the-art safe RL exploration techniques shows that ADVICE significantly reduces safety violations (approx 50%) during training, with a competitive outcome reward compared to other techniques.

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

- The paper is well written. The description of the method is precise. - The proposed method is original. It is the first method to use contrastive learning in a safe RL setting. - The empirical evaluation indicates the approach can potentially increase the safety of RL algorithms.

Weaknesses

The problem formulation is incomplete. The paper does not define the safety properties expected from the RL agent. - Lack of theoretical results. This paper provides only empirical results to support its claims. - The results are presented in a convoluted way. In particular, the results disregard the safety violations of the agent in the first 1000 episodes. The reason for presenting the results in this way is unclear. - The presentation of the DDPG-Lag as a constrained RL algorithm is imprecise

Reviewer 02Rating 6Confidence 3

Strengths

- ADVICE introduces a new way to handle safety in RL using a contrastive autoencoder for distinguishing safe and unsafe actions. - Despite prioritizing safety, ADVICE maintains competitive performance in terms of rewards compared to other methods. - ADVICE does not require prior knowledge about the environment, making it suitable for black-box scenarios.

Weaknesses

- ADVICE requires an initial period to gather data before it can be fully effective, which could be a disadvantage in some scenarios. - The paper suggests that ADVICE might struggle with dynamic environments and could benefit from incorporating temporal context, which would add additional computational load. - The performance of ADVICE is sensitive to hyperparameters like the safety threshold K, which might require careful tuning.

Reviewer 03Rating 3Confidence 4

Strengths

- This paper proposes a new shield-based method for safe exploration, which is an important problem for application of RL. - Overall, the paper is clearly written (e.g., fig. 1) and easy to follow. - The idea of classifying the safety of state-action in latent space is novel.

Weaknesses

- My biggest concern is the effectiveness of neighbor model in step 1, which determines whether a new state-action is safe or not. However, this key component in the proposed shielding method is trained based on data collected in an initial unshielded stage (line 161). During the execution, the policy will be updated and differ from the initial policy, which will lead to a severe distribution shift of state-action pair. Therefore, it's very questionable whether the neighbor model can still disti

Code & Models

Repositories

team-daniel/advice
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSmart Grid Security and Resilience