Risk-averse Total-reward MDPs with ERM and EVaR
Xihong Su, Julien Grand-Cl\'ement, Marek Petrik

TL;DR
This paper demonstrates that risk-averse total reward Markov Decision Processes with ERM and EVaR can be optimized using stationary policies, simplifying analysis and deployment in broad risk-averse reinforcement learning scenarios.
Contribution
It introduces a method to optimize risk-averse total reward MDPs with ERM and EVaR using stationary policies, applicable to transient MDPs with positive and negative rewards.
Findings
Exponential value iteration, policy iteration, and linear programming effectively compute optimal policies.
Results show the total reward criterion can outperform discounted criteria in risk-averse RL.
Applicable to MDPs with both positive and negative rewards under mild conditions.
Abstract
Optimizing risk-averse objectives in discounted MDPs is challenging because most models do not admit direct dynamic programming equations and require complex history-dependent policies. In this paper, we show that the risk-averse {\em total reward criterion}, under the Entropic Risk Measure (ERM) and Entropic Value at Risk (EVaR) risk measures, can be optimized by a stationary policy, making it simple to analyze, interpret, and deploy. We propose exponential value iteration, policy iteration, and linear programming to compute optimal policies. Compared with prior work, our results only require the relatively mild condition of transient MDPs and allow for {\em both} positive and negative rewards. Our results indicate that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning domains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAuction Theory and Applications
