Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods
Sara Klein, Simon Weissmann, Leif D\"oring

TL;DR
This paper introduces a dynamic policy gradient method for finite-horizon MDPs that trains parameters backwards in time, improving convergence analysis and exploiting problem structure better than traditional methods.
Contribution
It proposes a novel dynamic policy gradient approach that incorporates backward training in finite-horizon MDPs and provides convergence analysis for this method.
Findings
Dynamic policy gradient outperforms standard methods in convergence speed.
The approach better exploits the structure of finite-time horizon problems.
Convergence bounds are improved using dynamic training.
Abstract
Markov Decision Processes (MDPs) are a formal framework for modeling and solving sequential decision-making problems. In finite-time horizons such problems are relevant for instance for optimal stopping or specific supply chain problems, but also in the training of large language models. In contrast to infinite horizon MDPs optimal policies are not stationary, policies must be learned for every single epoch. In practice all parameters are often trained simultaneously, ignoring the inherent structure suggested by dynamic programming. This paper introduces a combination of dynamic programming and policy gradient called dynamic policy gradient, where the parameters are trained backwards in time. For the tabular softmax parametrisation we carry out the convergence analysis for simultaneous and dynamic policy gradient towards global optima, both in the exact and sampled gradient settings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Algorithms · Optimization and Search Problems · Reinforcement Learning in Robotics
MethodsSoftmax
