Extreme Value Policy Optimization for Safe Reinforcement Learning
Shiqing Gao, Yihang Zhou, Shuai Shao, Haoyu Luo, Yiheng Bing, Jiaxin Ding, Luoyi Fu, Xinbing Wang

TL;DR
This paper introduces EVO, a reinforcement learning algorithm that uses Extreme Value Theory to better model and mitigate rare, high-impact constraint violations, improving safety in real-world applications.
Contribution
EVO is the first RL method to explicitly incorporate extreme value modeling for safety constraints, reducing violations and providing theoretical guarantees.
Findings
EVO reduces constraint violations more effectively than baseline methods.
EVO maintains competitive policy performance.
EVO exhibits lower variance than quantile regression approaches.
Abstract
Ensuring safety is a critical challenge in applying Reinforcement Learning (RL) to real-world scenarios. Constrained Reinforcement Learning (CRL) addresses this by maximizing returns under predefined constraints, typically formulated as the expected cumulative cost. However, expectation-based constraints overlook rare but high-impact extreme value events in the tail distribution, such as black swan incidents, which can lead to severe constraint violations. To address this issue, we propose the Extreme Value policy Optimization (EVO) algorithm, leveraging Extreme Value Theory (EVT) to model and exploit extreme reward and cost samples, reducing constraint violations. EVO introduces an extreme quantile optimization objective to explicitly capture extreme samples in the cost tail distribution. Additionally, we propose an extreme prioritization mechanism during replay, amplifying the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Adaptive Dynamic Programming Control
