On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift
Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan

TL;DR
This paper provides a theoretical analysis of policy gradient methods in reinforcement learning, establishing convergence properties, approximation guarantees, and the impact of distribution shift in large state and action spaces.
Contribution
It offers the first provable characterizations of convergence, approximation, and sample complexity for policy gradient methods, including both tabular and parametric policies.
Findings
Global convergence for tabular policies to the optimal policy.
Agnostic learning results for parametric policy classes.
Approximation guarantees that depend on distribution shift, not worst-case state space size.
Abstract
Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning
