Safe Policy Improvement in Constrained Markov Decision Processes
Luigi Berducci, Radu Grosu

TL;DR
This paper introduces a method for safe reinforcement learning in constrained Markov decision processes, combining reward shaping from formal requirements with a safe policy update algorithm, ensuring safety guarantees during policy improvement.
Contribution
It presents an automatic reward-shaping procedure and a safe policy update algorithm with high-confidence guarantees, advancing safe RL in safety-critical applications.
Findings
Effective safe policy improvement demonstrated on control benchmarks.
Robustness maintained under hyperparameter perturbations.
Model-based RL enhances data efficiency and safety.
Abstract
The automatic synthesis of a policy through reinforcement learning (RL) from a given set of formal requirements depends on the construction of a reward signal and consists of the iterative application of many policy-improvement steps. The synthesis algorithm has to balance target, safety, and comfort requirements in a single objective and to guarantee that the policy improvement does not increase the number of safety-requirements violations, especially for safety-critical applications. In this work, we present a solution to the synthesis problem by solving its two main challenges: reward-shaping from a set of formal requirements and safe policy update. For the former, we propose an automatic reward-shaping procedure, defining a scalar reward signal compliant with the task specification. For the latter, we introduce an algorithm ensuring that the policy is improved in a safe fashion with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
