Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs
Michael Lu, Matin Aghaei, Anant Raj, Sharan Vaswani

TL;DR
This paper develops practical policy gradient algorithms for bandits and tabular MDPs that do not rely on unknown problem-specific parameters, achieving theoretical guarantees and competitive empirical performance.
Contribution
It introduces principled PG methods using line-search and decreasing step-sizes that avoid requiring oracle knowledge, with proven convergence rates.
Findings
Linear convergence in the exact setting with Armijo line-search.
Convergence guarantees in the stochastic setting with decreasing step-sizes.
Competitive empirical performance without oracle knowledge.
Abstract
We consider (stochastic) softmax policy gradient (PG) methods for bandits and tabular Markov decision processes (MDPs). While the PG objective is non-concave, recent research has used the objective's smoothness and gradient domination properties to achieve convergence to an optimal policy. However, these theoretical results require setting the algorithm parameters according to unknown problem-dependent quantities (e.g. the optimal action or the true reward vector in a bandit problem). To address this issue, we borrow ideas from the optimization literature to design practical, principled PG methods in both the exact and stochastic settings. In the exact setting, we employ an Armijo line-search to set the step-size for softmax PG and demonstrate a linear convergence rate. In the stochastic setting, we utilize exponentially decreasing step-sizes, and characterize the convergence rate of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management
MethodsSparse Evolutionary Training · Softmax
