Fast Convergence of Softmax Policy Mirror Ascent

Reza Asad; Reza Babanezhad; Issam Laradji; Nicolas Le Roux; Sharan Vaswani

arXiv:2411.12042·cs.LG·June 2, 2025

Fast Convergence of Softmax Policy Mirror Ascent

Reza Asad, Reza Babanezhad, Issam Laradji, Nicolas Le Roux, Sharan Vaswani

PDF

Open Access

TL;DR

This paper introduces SPMA, a refined policy gradient method that converges faster than existing algorithms, and extends it to large state-action spaces with empirical success on benchmarks.

Contribution

It refines and analyzes a new policy gradient algorithm, SPMA, achieving faster convergence and extending applicability to large-scale problems without requiring compatible function approximation.

Findings

01

SPMA matches NPG's linear convergence in tabular MDPs.

02

SPMA outperforms softmax policy gradient with acceleration.

03

Empirical results show SPMA's competitive performance on MuJoCo and Atari.

Abstract

Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Recently, Vaswani et al. [2021] introduced a policy gradient method that corresponds to mirror ascent in the dual space of logits. We refine this algorithm, removing its need for a normalization across actions and analyze the resulting method (referred to as SPMA). For tabular MDPs, we prove that SPMA with a constant step-size matches the linear convergence of NPG and achieves a faster convergence than constant step-size (accelerated) softmax policy gradient. To handle large state-action spaces, we extend SPMA to use a log-linear policy parameterization. Unlike that for NPG, generalizing SPMA to the linear function approximation (FA) setting does not require compatible function approximation. Unlike MDPO, a practical generalization of NPG, SPMA with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning

MethodsEntropy Regularization · Proximal Policy Optimization · Softmax · Mirror Descent Policy Optimization · Trust Region Policy Optimization · Feedback Alignment