Towards Principled, Practical Policy Gradient for Bandits and Tabular   MDPs

Michael Lu; Matin Aghaei; Anant Raj; Sharan Vaswani

arXiv:2405.13136·cs.LG·October 1, 2024

Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs

Michael Lu, Matin Aghaei, Anant Raj, Sharan Vaswani

PDF

Open Access 1 Repo

TL;DR

This paper develops practical policy gradient algorithms for bandits and tabular MDPs that do not rely on unknown problem-specific parameters, achieving theoretical guarantees and competitive empirical performance.

Contribution

It introduces principled PG methods using line-search and decreasing step-sizes that avoid requiring oracle knowledge, with proven convergence rates.

Findings

01

Linear convergence in the exact setting with Armijo line-search.

02

Convergence guarantees in the stochastic setting with decreasing step-sizes.

03

Competitive empirical performance without oracle knowledge.

Abstract

We consider (stochastic) softmax policy gradient (PG) methods for bandits and tabular Markov decision processes (MDPs). While the PG objective is non-concave, recent research has used the objective's smoothness and gradient domination properties to achieve convergence to an optimal policy. However, these theoretical results require setting the algorithm parameters according to unknown problem-dependent quantities (e.g. the optimal action or the true reward vector in a bandit problem). To address this issue, we borrow ideas from the optimization literature to design practical, principled PG methods in both the exact and stochastic settings. In the exact setting, we employ an Armijo line-search to set the step-size for softmax PG and demonstrate a linear convergence rate. In the stochastic setting, we utilize exponentially decreasing step-sizes, and characterize the convergence rate of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sudo-michael/practical-pg
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management

MethodsSparse Evolutionary Training · Softmax