Constrained Policy Optimization with Cantelli-Bounded Value-at-Risk

Rohan Tangri; Jan-Peter Calliess

arXiv:2601.22993·cs.LG·May 1, 2026

Constrained Policy Optimization with Cantelli-Bounded Value-at-Risk

Rohan Tangri, Jan-Peter Calliess

PDF

TL;DR

This paper presents VaR-CPO, a sample-efficient reinforcement learning algorithm that safely optimizes Value-at-Risk constraints using Cantelli's inequality, ensuring zero constraint violations during training.

Contribution

The paper introduces a novel VaR-constrained RL method employing Cantelli's inequality and extends CPO to provide worst-case bounds on policy improvement and constraint violations.

Findings

01

VaR-CPO achieves zero constraint violations in feasible environments.

02

The method provides worst-case bounds on policy improvement and constraint violations.

03

Empirical results demonstrate safe exploration capabilities.

Abstract

We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constrained reinforcement learning (RL) problems. Empirically, we demonstrate that VaR-CPO is capable of safe exploration, achieving zero constraint violations during training in feasible environments, a critical property that baseline methods fail to uphold. To overcome the inherent non-differentiability of the VaR constraint, we employ Cantelli's inequality to obtain a tractable approximation based on the first two moments of the cost return. Additionally, by extending the trust-region framework of the Constrained Policy Optimization (CPO) method, we provide worst-case bounds for both policy improvement and constraint violation during the training process.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.