Provably Efficient Safe Exploration via Primal-Dual Policy Optimization
Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, Mihailo R., Jovanovi\'c

TL;DR
This paper introduces an efficient algorithm for safe reinforcement learning in complex environments, balancing reward maximization and safety constraints with provable guarantees in a function approximation setting.
Contribution
It proposes the first provably efficient online policy optimization algorithm for CMDPs with safety constraints under function approximation.
Findings
Achieves $ ilde{O}(d H^{2.5}\sqrt{T})$ regret and constraint violation bounds.
Handles infinite state spaces via feature mapping.
Provides theoretical guarantees for safe exploration in CMDPs.
Abstract
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation in which an agent aims to maximize the expected total reward subject to a safety constraint on the expected total value of a utility function. We focus on an episodic setting with the function approximation where the Markov transition kernels have a linear structure but do not impose any additional assumptions on the sampling model. Designing SRL algorithms with provable computational and statistical efficiency is particularly challenging under this setting because of the need to incorporate both the safety constraint and the function approximation into the fundamental exploitation/exploration tradeoff. To this end, we present an \underline{O}ptimistic \underline{P}rimal-\underline{D}ual Proximal Policy \underline{OP}timization (OPDOP) algorithm where the value…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning
