Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity
Yan Li, Guanghui Lan, Tuo Zhao

TL;DR
This paper introduces homotopic policy mirror descent (HPMD), a new policy gradient method with strong convergence guarantees and improved sample complexity for solving discounted MDPs, extending to stochastic settings and various divergence measures.
Contribution
The paper presents HPMD with global and local convergence guarantees, certifies the limiting policy as optimal with maximal entropy, and extends results to stochastic versions and diverse divergence functions.
Findings
Global linear convergence of HPMD with KL divergence.
Local superlinear convergence without assumptions.
Improved sample complexity under generative model.
Abstract
We propose a new policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. We first establish the global linear convergence of HPMD instantiated with Kullback-Leibler divergence, for both the optimality gap, and a weighted distance to the set of optimal policies. Then local superlinear convergence is obtained for both quantities without any assumption. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Stochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning
