Constrained Policy Optimization with Cantelli-Bounded Value-at-Risk
Rohan Tangri, Jan-Peter Calliess

TL;DR
This paper presents VaR-CPO, a sample-efficient reinforcement learning algorithm that safely optimizes Value-at-Risk constraints using Cantelli's inequality, ensuring zero constraint violations during training.
Contribution
The paper introduces a novel VaR-constrained RL method employing Cantelli's inequality and extends CPO to provide worst-case bounds on policy improvement and constraint violations.
Findings
VaR-CPO achieves zero constraint violations in feasible environments.
The method provides worst-case bounds on policy improvement and constraint violations.
Empirical results demonstrate safe exploration capabilities.
Abstract
We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constrained reinforcement learning (RL) problems. Empirically, we demonstrate that VaR-CPO is capable of safe exploration, achieving zero constraint violations during training in feasible environments, a critical property that baseline methods fail to uphold. To overcome the inherent non-differentiability of the VaR constraint, we employ Cantelli's inequality to obtain a tractable approximation based on the first two moments of the cost return. Additionally, by extending the trust-region framework of the Constrained Policy Optimization (CPO) method, we provide worst-case bounds for both policy improvement and constraint violation during the training process.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
