Regret Analysis of a Markov Policy Gradient Algorithm for Multi-arm Bandits
Denis Denisov, Neil Walton

TL;DR
This paper analyzes a Markov policy gradient algorithm for multi-armed bandits with Bernoulli rewards, showing that with appropriate learning rates, it converges to the optimal arm with low regret.
Contribution
It introduces a novel analysis of a policy gradient method with state-dependent learning rates, proving convergence and low regret in a bandit setting.
Findings
Algorithm converges to optimal arm with logarithmic regret
State-dependent learning rates improve convergence analysis
Markov chain stability is established using Foster-Lyapunov techniques
Abstract
We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm, rather than use a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster-Lyapunov techniques to analyse the stability of this Markov chain. We prove that if learning rates are well chosen then the policy gradient algorithm is a transient Markov chain and the state of the chain converges on the optimal arm with logarithmic or poly-logarithmic regret.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques
