A Short Note on Soft-max and Policy Gradients in Bandits Problems
Neil Walton

TL;DR
This paper provides a Lyapunov function-based analysis of softmax policy gradients in bandit problems, deriving regret bounds for differential equation models and discussing stochastic regret bounds.
Contribution
It introduces a Lyapunov function argument for softmax policy gradients in bandits and derives regret bounds for both deterministic and stochastic cases.
Findings
Regret bounds for softmax policy gradient ODEs in bandit problems
Analysis of a different policy gradient algorithm with regret bounds
Discussion on challenges in deriving stochastic regret bounds
Abstract
This is a short communication on a Lyapunov function argument for softmax in bandit problems. There are a number of excellent papers coming out using differential equations for policy gradient algorithms in reinforcement learning \cite{agarwal2019optimality,bhandari2019global,mei2020global}. We give a short argument that gives a regret bound for the soft-max ordinary differential equation for bandit problems. We derive a similar result for a different policy gradient algorithm, again for bandit problems. For this second algorithm, it is possible to prove regret bounds in the stochastic case \cite{DW20}. At the end, we summarize some ideas and issues on deriving stochastic regret bounds for policy gradients.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Adaptive Dynamic Programming Control
MethodsSoftmax
