Fast Convergence of Softmax Policy Mirror Ascent
Reza Asad, Reza Babanezhad, Issam Laradji, Nicolas Le Roux, Sharan Vaswani

TL;DR
This paper introduces SPMA, a refined policy gradient method that converges faster than existing algorithms, and extends it to large state-action spaces with empirical success on benchmarks.
Contribution
It refines and analyzes a new policy gradient algorithm, SPMA, achieving faster convergence and extending applicability to large-scale problems without requiring compatible function approximation.
Findings
SPMA matches NPG's linear convergence in tabular MDPs.
SPMA outperforms softmax policy gradient with acceleration.
Empirical results show SPMA's competitive performance on MuJoCo and Atari.
Abstract
Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Recently, Vaswani et al. [2021] introduced a policy gradient method that corresponds to mirror ascent in the dual space of logits. We refine this algorithm, removing its need for a normalization across actions and analyze the resulting method (referred to as SPMA). For tabular MDPs, we prove that SPMA with a constant step-size matches the linear convergence of NPG and achieves a faster convergence than constant step-size (accelerated) softmax policy gradient. To handle large state-action spaces, we extend SPMA to use a log-linear policy parameterization. Unlike that for NPG, generalizing SPMA to the linear function approximation (FA) setting does not require compatible function approximation. Unlike MDPO, a practical generalization of NPG, SPMA with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning
MethodsEntropy Regularization · Proximal Policy Optimization · Softmax · Mirror Descent Policy Optimization · Trust Region Policy Optimization · Feedback Alignment
