On the Global Convergence Rates of Softmax Policy Gradient Methods
Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, Dale Schuurmans

TL;DR
This paper provides a theoretical analysis of the convergence rates of softmax policy gradient methods, showing that entropy regularization accelerates convergence from sublinear to linear, with implications for policy optimization.
Contribution
It establishes the first $O(1/t)$ convergence rate with true gradient and demonstrates that entropy regularization achieves a faster linear rate, explaining its empirical benefits.
Findings
Policy gradient with true gradient converges at $O(1/t)$ rate.
Entropy regularization leads to linear convergence rate.
Entropy improves policy optimization efficiency.
Abstract
We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a \L{}ojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate toward softmax optimal policy . This result resolves an open question in the recent literature. Finally, combining the above two results and additional new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics · Advanced Bandit Algorithms Research
MethodsEntropy Regularization · Softmax
