Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost   Pessimism and Reward Optimism

Kihyun Yu; Duksang Lee; William Overman; Dabeen Lee

arXiv:2410.10158·cs.LG·October 15, 2024

Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

Kihyun Yu, Duksang Lee, William Overman, Dabeen Lee

PDF

Open Access

TL;DR

This paper introduces a new model-based safe reinforcement learning algorithm with tighter cost and reward estimators, achieving improved regret bounds while ensuring no constraint violations, and nearly matching the theoretical lower bounds in certain settings.

Contribution

It proposes novel cost and reward estimators based on a Bellman-type law of total variance, leading to tighter regret bounds and improved theoretical guarantees in safe RL.

Findings

01

Achieves a regret upper bound of $ ilde{O}((ar C - ar C_b)^{-1}H^{2.5} S oot{A}K)$

02

Nearly matches the regret lower bound when $ar C - ar C_b= ext{Omega}(H)$

03

Demonstrates computational effectiveness through numerical experiments

Abstract

This paper studies the safe reinforcement learning problem formulated as an episodic finite-horizon tabular constrained Markov decision process with an unknown transition kernel and stochastic reward and cost functions. We propose a model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism. While guaranteeing no constraint violation in every episode, our algorithm achieves a regret upper bound of $O ((\overset{ˉ}{C} - \overset{ˉ}{C}_{b})^{- 1} H^{2.5} S A K)$ where $\overset{ˉ}{C}$ is the cost budget for an episode, $\overset{ˉ}{C}_{b}$ is the expected cost under a safe baseline policy over an episode, $H$ is the horizon, and $S$ , $A$ and $K$ are the number of states, actions, and episodes, respectively. This improves upon the best-known regret upper bound, and when $\overset{ˉ}{C} - \overset{ˉ}{C}_{b} = Ω (H)$ , it nearly matches the regret…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Traffic control and management