Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy
Han Zhong, Xun Deng, Ethan X. Fang, Zhuoran Yang, Zhaoran Wang, Runze, Li

TL;DR
This paper introduces a risk-sensitive deep reinforcement learning method that optimizes policies under variance constraints, providing theoretical guarantees of global optimality and demonstrating effectiveness on real datasets.
Contribution
It develops a novel actor-critic algorithm for variance-constrained policy optimization with provable convergence to a globally optimal policy.
Findings
The algorithm converges to a globally optimal policy at a sublinear rate.
The method effectively manages risk by constraining variance in long-term rewards.
Numerical studies validate theoretical results on real datasets.
Abstract
While deep reinforcement learning has achieved tremendous successes in various applications, most existing works only focus on maximizing the expected value of total return and thus ignore its inherent stochasticity. Such stochasticity is also known as the aleatoric uncertainty and is closely related to the notion of risk. In this work, we make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. In particular, we focus on a variance-constrained policy optimization problem where the goal is to find a policy that maximizes the expected value of the long-run average reward, subject to a constraint that the long-run variance of the average reward is upper bounded by a threshold. Utilizing Lagrangian and Fenchel dualities, we transform the original problem into an unconstrained saddle-point policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Energy, Environment, and Transportation Policies
