A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence
Carlo Alfano, Rui Yuan, Patrick Rebeschini

TL;DR
This paper introduces a new policy optimization framework based on mirror descent that supports general parameterizations, guarantees linear convergence, and improves sample complexity with neural networks, validated on control tasks.
Contribution
It develops a novel mirror descent-based policy optimization framework that handles general parameterizations and provides the first linear convergence guarantee for such methods.
Findings
Guarantees linear convergence for general parameterized policies.
Improves sample complexity for shallow neural network policies.
Empirically validates theoretical results on control tasks.
Abstract
Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parameterizations. The policy class induced by our scheme recovers known classes, e.g., softmax, and generates new ones depending on the choice of mirror map. Using our framework, we obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization. To demonstrate the ability of our framework to accommodate general parameterization schemes, we provide its sample…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Memory and Neural Computing · Reinforcement Learning in Robotics · Fuel Cells and Related Materials
MethodsEntropy Regularization · Proximal Policy Optimization · Trust Region Policy Optimization
