Global Convergence of Policy Gradient Methods to (Almost) Locally Optimal Policies
Kaiqing Zhang, Alec Koppel, Hao Zhu, Tamer Ba\c{s}ar

TL;DR
This paper proves that policy gradient methods can globally converge to nearly locally optimal policies in reinforcement learning by using a nonconvex optimization framework, introducing a new algorithm with practical benefits demonstrated on the inverted pendulum.
Contribution
It introduces a novel variant of policy gradient methods with unbiased gradient estimates and shows their global convergence to local optima, bridging a key gap in theoretical understanding.
Findings
The new PG variant converges to stationary points with known rates.
Modified PG with enlarged stepsizes escapes saddle points.
Reward reshaping helps avoid saddle points and improves policy quality.
Abstract
Policy gradient (PG) methods are a widely used reinforcement learning methodology in many applications such as video games, autonomous driving, and robotics. In spite of its empirical success, a rigorous understanding of the global convergence of PG methods is lacking in the literature. In this work, we close the gap by viewing PG methods from a nonconvex optimization perspective. In particular, we propose a new variant of PG methods for infinite-horizon problems that uses a random rollout horizon for the Monte-Carlo estimation of the policy gradient. This method then yields an unbiased estimate of the policy gradient with bounded variance, which enables the tools from nonconvex optimization to be applied to establish global convergence. Employing this perspective, we first recover the convergence results with rates to the stationary-point policies in the literature. More interestingly,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques
