TL;DR
This paper explores risk-averse reward optimization in Markov decision processes by analyzing deviation measures like semi-variance and MAD, aiming to improve upon variance-based methods and develop effective algorithms.
Contribution
It introduces and analyzes alternative deviation measures for risk-averse optimization in MDPs, addressing limitations of variance-based approaches and providing properties of optimal policies.
Findings
Semi-variance and MAD improve risk-averse behavior.
Optimal schedulers for these measures avoid discouraging reward accumulation.
Algorithms are developed for MDPs and Markov chains.
Abstract
This paper addresses objectives tailored to the risk-averse optimization of accumulated rewards in Markov decision processes (MDPs). The studied objectives require maximizing the expected value of the accumulated rewards minus a penalty factor times a deviation measure of the resulting distribution of rewards. Using the variance in this penalty mechanism leads to the variance-penalized expectation (VPE) for which it is known that optimal schedulers have to minimize future expected rewards when a high amount of rewards has been accumulated. This behavior is undesirable as risk-averse behavior should keep the probability of particularly low outcomes low, but not discourage the accumulation of additional rewards on already good executions. The paper investigates the semi-variance, which only takes outcomes below the expected value into account, the mean absolute deviation (MAD), and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
