Regret Analysis of a Markov Policy Gradient Algorithm for Multi-arm   Bandits

Denis Denisov; Neil Walton

arXiv:2007.10229·cs.LG·September 24, 2021

Regret Analysis of a Markov Policy Gradient Algorithm for Multi-arm Bandits

Denis Denisov, Neil Walton

PDF

Open Access

TL;DR

This paper analyzes a Markov policy gradient algorithm for multi-armed bandits with Bernoulli rewards, showing that with appropriate learning rates, it converges to the optimal arm with low regret.

Contribution

It introduces a novel analysis of a policy gradient method with state-dependent learning rates, proving convergence and low regret in a bandit setting.

Findings

01

Algorithm converges to optimal arm with logarithmic regret

02

State-dependent learning rates improve convergence analysis

03

Markov chain stability is established using Foster-Lyapunov techniques

Abstract

We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm, rather than use a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster-Lyapunov techniques to analyse the stability of this Markov chain. We prove that if learning rates are well chosen then the policy gradient algorithm is a transient Markov chain and the state of the chain converges on the optimal arm with logarithmic or poly-logarithmic regret.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques