A Diffusion Analysis of Policy Gradient for Stochastic Bandits

Tor Lattimore

arXiv:2603.10219·stat.ML·March 12, 2026

A Diffusion Analysis of Policy Gradient for Stochastic Bandits

Tor Lattimore

PDF

Open Access

TL;DR

This paper analyzes the behavior of policy gradient algorithms in stochastic bandits using a continuous-time diffusion approximation, providing regret bounds and demonstrating limitations for certain learning rates.

Contribution

It introduces a diffusion approximation framework for policy gradient in stochastic bandits and derives regret bounds, highlighting the impact of learning rate choices.

Findings

01

Regret is bounded by O(k log(k) log(n) / η) with an appropriate learning rate η.

02

A constructed instance shows linear regret unless η = O(Δ^2).

03

The analysis reveals limitations of policy gradient methods under certain parameter settings.

Abstract

We study a continuous-time diffusion approximation of policy gradient for $k$ -armed stochastic bandits. We prove that with a learning rate $η = O (Δ^{2} / lo g (n))$ the regret is $O (k lo g (k) lo g (n) / η)$ where $n$ is the horizon and $Δ$ the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless $η = O (Δ^{2})$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics