Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action
Xin Chen, Yifan Hu, Minda Zhao

TL;DR
This paper establishes a theoretical foundation for policy gradient methods in finite-horizon MDPs with general state and action spaces, proving global convergence under certain conditions and providing sample complexity guarantees.
Contribution
It introduces the P{ extL}K condition for finite-horizon MDPs, enabling the first sample complexity guarantees for complex stochastic control models.
Findings
Policy gradient methods converge globally under the P{ extL}K condition.
Sample complexity is $ ilde{O}(rac{1}{ extepsilon})$ for $ extepsilon$-optimal policies.
Numerical experiments show superior performance over benchmark algorithms.
Abstract
Policy gradient methods are widely used in reinforcement learning. Yet, the nonconvexity of policy optimization poses significant challenges in understanding the global convergence of policy gradient methods. For a class of finite-horizon Markov Decision Processes (MDPs) with general state and action spaces, we identify a set of structural properties to establish a benign nonconvex landscape, the Polyak-{\L}ojasiewicz-Kurdyka (P{\L}K) condition of the policy optimization. Leveraging the P{\L}K condition, policy gradient methods converge to the globally optimal policy with a non-asymptotic rate despite nonconvexity. Our results apply to various control and operations models, including entropy-regularized tabular MDPs, Linear Quadratic Regulator problems, and both stochastic inventory models and stochastic cash balance problems with strongly convex costs. In these models, stochastic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuction Theory and Applications
MethodsSparse Evolutionary Training
