A Short Note on Soft-max and Policy Gradients in Bandits Problems

Neil Walton

arXiv:2007.10297·cs.LG·July 21, 2020·1 cites

A Short Note on Soft-max and Policy Gradients in Bandits Problems

Neil Walton

PDF

Open Access

TL;DR

This paper provides a Lyapunov function-based analysis of softmax policy gradients in bandit problems, deriving regret bounds for differential equation models and discussing stochastic regret bounds.

Contribution

It introduces a Lyapunov function argument for softmax policy gradients in bandits and derives regret bounds for both deterministic and stochastic cases.

Findings

01

Regret bounds for softmax policy gradient ODEs in bandit problems

02

Analysis of a different policy gradient algorithm with regret bounds

03

Discussion on challenges in deriving stochastic regret bounds

Abstract

This is a short communication on a Lyapunov function argument for softmax in bandit problems. There are a number of excellent papers coming out using differential equations for policy gradient algorithms in reinforcement learning \cite{agarwal2019optimality,bhandari2019global,mei2020global}. We give a short argument that gives a regret bound for the soft-max ordinary differential equation for bandit problems. We derive a similar result for a different policy gradient algorithm, again for bandit problems. For this second algorithm, it is possible to prove regret bounds in the stochastic case \cite{DW20}. At the end, we summarize some ideas and issues on deriving stochastic regret bounds for policy gradients.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Adaptive Dynamic Programming Control

MethodsSoftmax