On the Theory of Policy Gradient Methods: Optimality, Approximation, and   Distribution Shift

Alekh Agarwal; Sham M. Kakade; Jason D. Lee; Gaurav Mahajan

arXiv:1908.00261·cs.LG·October 16, 2020·111 cites

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of policy gradient methods in reinforcement learning, establishing convergence properties, approximation guarantees, and the impact of distribution shift in large state and action spaces.

Contribution

It offers the first provable characterizations of convergence, approximation, and sample complexity for policy gradient methods, including both tabular and parametric policies.

Findings

01

Global convergence for tabular policies to the optimal policy.

02

Agnostic learning results for parametric policy classes.

03

Approximation guarantees that depend on distribution shift, not worst-case state space size.

Abstract

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning