Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients
Alex DeWeese, Guannan Qu

TL;DR
This paper introduces a $k$-step policy gradient method that overcomes the myopic limitations of standard policy gradients, enabling convergence to near-optimal policies in restricted classes with theoretical guarantees.
Contribution
It proposes a generalized $k$-step policy gradient approach with exponential performance guarantees and improved convergence properties for restricted policy classes.
Findings
The $k$-step policy gradient can escape local optima in MDPs.
The method guarantees convergence to near-optimal policies exponentially close to the best.
It achieves these guarantees with fewer issues related to distribution mismatch.
Abstract
This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improves the policy based on the one-step -function. In this work, we propose a generalized -step policy gradient method that couples the randomness within a -step time window and can escape the myopic local optima in MDPs with restricted policy classes. We show this new method is theoretically guaranteed to converge to a solution that is exponentially close in performance to the optimal deterministic policy with respect to . Further, we show projected gradient descent and mirror descent with this -step policy gradient can achieve this exponential guarantee in iterations, despite only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
