Provably Safe Reinforcement Learning for Stochastic Reach-Avoid Problems with Entropy Regularization
Abhijit Mazumdar, Rafal Wisniewski, Manuela L. Bujorianu

TL;DR
This paper develops safe reinforcement learning algorithms for stochastic reach-avoid problems, incorporating entropy regularization to improve regret bounds and safety guarantees during learning.
Contribution
It introduces a novel entropy-regularized safe RL algorithm with finite-sample guarantees for stochastic reach-avoid tasks, enhancing safety and performance.
Findings
Entropy regularization improves regret bounds.
The algorithms ensure safety with high probability during learning.
Entropy regularization reduces episode-to-episode variability.
Abstract
We consider the problem of learning the optimal policy for Markov decision processes with safety constraints. We formulate the problem in a reach-avoid setup. Our goal is to design online reinforcement learning algorithms that ensure safety constraints with arbitrarily high probability during the learning phase. To this end, we first propose an algorithm based on the optimism in the face of uncertainty (OFU) principle. Based on the first algorithm, we propose our main algorithm, which utilizes entropy regularization. We investigate the finite-sample analysis of both algorithms and derive their regret bounds. We demonstrate that the inclusion of entropy regularization improves the regret and drastically controls the episode-to-episode variability that is inherent in OFU-based safe RL algorithms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization
