Provably Safe Reinforcement Learning for Stochastic Reach-Avoid Problems with Entropy Regularization

Abhijit Mazumdar; Rafal Wisniewski; Manuela L. Bujorianu

arXiv:2601.08646·cs.LG·January 21, 2026

Provably Safe Reinforcement Learning for Stochastic Reach-Avoid Problems with Entropy Regularization

Abhijit Mazumdar, Rafal Wisniewski, Manuela L. Bujorianu

PDF

Open Access

TL;DR

This paper develops safe reinforcement learning algorithms for stochastic reach-avoid problems, incorporating entropy regularization to improve regret bounds and safety guarantees during learning.

Contribution

It introduces a novel entropy-regularized safe RL algorithm with finite-sample guarantees for stochastic reach-avoid tasks, enhancing safety and performance.

Findings

01

Entropy regularization improves regret bounds.

02

The algorithms ensure safety with high probability during learning.

03

Entropy regularization reduces episode-to-episode variability.

Abstract

We consider the problem of learning the optimal policy for Markov decision processes with safety constraints. We formulate the problem in a reach-avoid setup. Our goal is to design online reinforcement learning algorithms that ensure safety constraints with arbitrarily high probability during the learning phase. To this end, we first propose an algorithm based on the optimism in the face of uncertainty (OFU) principle. Based on the first algorithm, we propose our main algorithm, which utilizes entropy regularization. We investigate the finite-sample analysis of both algorithms and derive their regret bounds. We demonstrate that the inclusion of entropy regularization improves the regret and drastically controls the episode-to-episode variability that is inherent in OFU-based safe RL algorithms.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization