Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

Alex DeWeese; Guannan Qu

arXiv:2605.10909·cs.LG·May 12, 2026

Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

Alex DeWeese, Guannan Qu

PDF

TL;DR

This paper introduces a $k$-step policy gradient method that overcomes the myopic limitations of standard policy gradients, enabling convergence to near-optimal policies in restricted classes with theoretical guarantees.

Contribution

It proposes a generalized $k$-step policy gradient approach with exponential performance guarantees and improved convergence properties for restricted policy classes.

Findings

01

The $k$-step policy gradient can escape local optima in MDPs.

02

The method guarantees convergence to near-optimal policies exponentially close to the best.

03

It achieves these guarantees with fewer issues related to distribution mismatch.

Abstract

This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improves the policy based on the one-step $Q$ -function. In this work, we propose a generalized $k$ -step policy gradient method that couples the randomness within a $k$ -step time window and can escape the myopic local optima in MDPs with restricted policy classes. We show this new method is theoretically guaranteed to converge to a solution that is exponentially close in performance to the optimal deterministic policy with respect to $k$ . Further, we show projected gradient descent and mirror descent with this $k$ -step policy gradient can achieve this exponential guarantee in $O (\frac{1}{T})$ iterations, despite only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.