Risk-Averse Trust Region Optimization for Reward-Volatility Reduction

Lorenzo Bisi; Luca Sabbioni; Edoardo Vittori; Matteo Papini; Marcello; Restelli

arXiv:1912.03193·cs.LG·December 9, 2019

Risk-Averse Trust Region Optimization for Reward-Volatility Reduction

Lorenzo Bisi, Luca Sabbioni, Edoardo Vittori, Matteo Papini, Marcello, Restelli

PDF

TL;DR

This paper introduces a novel risk measure called reward volatility for reinforcement learning, and develops a risk-averse policy optimization method that reduces reward volatility to control return variance, with applications in financial simulations.

Contribution

It proposes a new reward volatility measure, derives a policy gradient theorem based on it, and adapts TRPO for risk-averse optimization in RL.

Findings

01

Reward volatility bounds return variance, enabling risk control.

02

The proposed method effectively reduces reward volatility in financial simulations.

03

The approach maintains monotonic policy improvement guarantees.

Abstract

In real-world decision-making problems, for instance in the fields of finance, robotics or autonomous driving, keeping uncertainty under control is as important as maximizing expected returns. Risk aversion has been addressed in the reinforcement learning literature through risk measures related to the variance of returns. However, in many cases, the risk is measured not only on a long-term perspective, but also on the step-wise rewards (e.g., in trading, to ensure the stability of the investment bank, it is essential to monitor the risk of portfolio positions on a daily basis). In this paper, we define a novel measure of risk, which we call reward volatility, consisting of the variance of the rewards under the state-occupancy measure. We show that the reward volatility bounds the return variance so that reducing the former also constrains the latter. We derive a policy gradient theorem…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTest · Trust Region Policy Optimization