Robust Probabilistic Shielding for Safe Offline Reinforcement Learning
Maris F. L. Galesloot, Thomas Rhemrev, Nils Jansen

TL;DR
This paper introduces a method combining safe policy improvement and shielding to ensure performance and safety guarantees in offline reinforcement learning, demonstrated through improved experimental results.
Contribution
It extends shielding techniques to offline RL, providing high-probability safety guarantees during policy improvement using only dataset and safety knowledge.
Findings
Shielded SPI outperforms unshielded methods in experiments.
The approach improves both average and worst-case performance.
Effectiveness is especially notable in low-data regimes.
Abstract
In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a performance guarantee: with high probability, the new policy outperforms a given baseline policy, which is assumed to be safe. Orthogonally, in the context of safe RL, a shield provides a safety guarantee by restricting the action space to those actions that are provably safe with respect to a given safety-relevant model. We integrate these paradigms by extending shielding to offline RL, relying solely on the available dataset and knowledge of safe and unsafe states. Then, we shield the policy improvement steps, guaranteeing, with high probability, a safe policy. Experimental results demonstrate that shielded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
