Extreme Value Policy Optimization for Safe Reinforcement Learning

Shiqing Gao; Yihang Zhou; Shuai Shao; Haoyu Luo; Yiheng Bing; Jiaxin Ding; Luoyi Fu; Xinbing Wang

arXiv:2601.12008·cs.LG·January 21, 2026

Extreme Value Policy Optimization for Safe Reinforcement Learning

Shiqing Gao, Yihang Zhou, Shuai Shao, Haoyu Luo, Yiheng Bing, Jiaxin Ding, Luoyi Fu, Xinbing Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces EVO, a reinforcement learning algorithm that uses Extreme Value Theory to better model and mitigate rare, high-impact constraint violations, improving safety in real-world applications.

Contribution

EVO is the first RL method to explicitly incorporate extreme value modeling for safety constraints, reducing violations and providing theoretical guarantees.

Findings

01

EVO reduces constraint violations more effectively than baseline methods.

02

EVO maintains competitive policy performance.

03

EVO exhibits lower variance than quantile regression approaches.

Abstract

Ensuring safety is a critical challenge in applying Reinforcement Learning (RL) to real-world scenarios. Constrained Reinforcement Learning (CRL) addresses this by maximizing returns under predefined constraints, typically formulated as the expected cumulative cost. However, expectation-based constraints overlook rare but high-impact extreme value events in the tail distribution, such as black swan incidents, which can lead to severe constraint violations. To address this issue, we propose the Extreme Value policy Optimization (EVO) algorithm, leveraging Extreme Value Theory (EVT) to model and exploit extreme reward and cost samples, reducing constraint violations. EVO introduces an extreme quantile optimization objective to explicitly capture extreme samples in the cost tail distribution. Additionally, we propose an extreme prioritization mechanism during replay, amplifying the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Extreme Value Policy Optimization for Safe Reinforcement Learning· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Adaptive Dynamic Programming Control