Variational Policy Gradient Method for Reinforcement Learning with   General Utilities

Junyu Zhang; Alec Koppel; Amrit Singh Bedi; Csaba Szepesvari; and; Mengdi Wang

arXiv:2007.02151·cs.LG·July 7, 2020·37 cites

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and, Mengdi Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces a variational policy gradient method for reinforcement learning with general utility functions, enabling policy optimization beyond traditional reward sums, with proven convergence guarantees and improved rates.

Contribution

It derives a new variational policy gradient theorem for general utilities, providing a practical algorithm with convergence analysis and rate improvements over existing methods.

Findings

01

The proposed algorithm converges globally to the optimal policy.

02

It achieves an $O(1/t)$ convergence rate, faster under hidden convexity.

03

It generalizes policy gradient methods to broader utility functions.

Abstract

In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Variational Policy Gradient Method for Reinforcement Learning with General Utilities· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Traffic control and management