Policy Newton methods for Distortion Riskmetrics
Soumen Pachal, Mizhaan Prajit Maniyar, Prashanth L.A

TL;DR
This paper introduces a novel policy Newton method for risk-sensitive reinforcement learning that maximizes distortion riskmetrics, providing convergence guarantees to second-order stationary points and demonstrating effectiveness through experiments.
Contribution
It develops a new Hessian estimator for DRM objectives and proposes a cubic-regularized policy Newton algorithm with convergence guarantees to second-order stationary points.
Findings
Algorithm converges to an $oldsymbol{ ext{ extit{epsilon}}}$-second-order stationary point.
Sample complexity is $oldsymbol{ ext{O}( ext{ extit{epsilon}}^{-3.5})}$ for convergence.
Experiments validate theoretical convergence results.
Abstract
We consider the problem of risk-sensitive control in a reinforcement learning (RL) framework. In particular, we aim to find a risk-optimal policy by maximizing the distortion riskmetric (DRM) of the discounted reward in a finite horizon Markov decision process (MDP). DRMs are a rich class of risk measures that include several well-known risk measures as special cases. We derive a policy Hessian theorem for the DRM objective using the likelihood ratio method. Using this result, we propose a natural DRM Hessian estimator from sample trajectories of the underlying MDP. Next, we present a cubic-regularized policy Newton algorithm for solving this problem in an on-policy RL setting using estimates of the DRM gradient and Hessian. Our proposed algorithm is shown to converge to an -second-order stationary point (-SOSP) of the DRM objective, and this guarantee ensures the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques
