Safe Reinforcement Learning in Constrained Markov Decision Processes
Akifumi Wachi, Yanan Sui

TL;DR
This paper introduces SNO-MDP, an algorithm for safe reinforcement learning that learns safety constraints and optimizes rewards within safe regions, with theoretical guarantees and practical validation in synthetic and real-world inspired environments.
Contribution
The paper proposes SNO-MDP, a novel algorithm that explores and optimizes safety and reward in unknown constrained MDPs with theoretical safety and optimality guarantees.
Findings
SNO-MDP effectively learns safety constraints and optimizes rewards.
Theoretical guarantees ensure safety constraint satisfaction and near-optimal reward.
Experimental results demonstrate success in synthetic and Mars exploration scenarios.
Abstract
Safe reinforcement learning has been a promising approach for optimizing the policy of an agent that operates in safety-critical applications. In this paper, we propose an algorithm, SNO-MDP, that explores and optimizes Markov decision processes under unknown safety constraints. Specifically, we take a stepwise approach for optimizing safety and cumulative reward. In our method, the agent first learns safety constraints by expanding the safe region, and then optimizes the cumulative reward in the certified safe region. We provide theoretical guarantees on both the satisfaction of the safety constraint and the near-optimality of the cumulative reward under proper regularity assumptions. In our experiments, we demonstrate the effectiveness of SNO-MDP through two experiments: one uses a synthetic data in a new, openly-available environment named GP-SAFETY-GYM, and the other simulates Mars…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Data Stream Mining Techniques
