A Diffusion Analysis of Policy Gradient for Stochastic Bandits
Tor Lattimore

TL;DR
This paper analyzes the behavior of policy gradient algorithms in stochastic bandits using a continuous-time diffusion approximation, providing regret bounds and demonstrating limitations for certain learning rates.
Contribution
It introduces a diffusion approximation framework for policy gradient in stochastic bandits and derives regret bounds, highlighting the impact of learning rate choices.
Findings
Regret is bounded by O(k log(k) log(n) / η) with an appropriate learning rate η.
A constructed instance shows linear regret unless η = O(Δ^2).
The analysis reveals limitations of policy gradient methods under certain parameter settings.
Abstract
We study a continuous-time diffusion approximation of policy gradient for -armed stochastic bandits. We prove that with a learning rate the regret is where is the horizon and the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics
