A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

Tor Lattimore

arXiv:2603.26547·cs.LG·March 30, 2026

A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

Tor Lattimore

PDF

TL;DR

This paper extends continuous-time policy gradient analysis to discrete-time stochastic bandits, providing regret bounds based on Lyapunov methods and specific learning rate conditions.

Contribution

It adapts the Lyapunov analysis of policy gradient methods from continuous to discrete time for stochastic bandits, deriving regret bounds with explicit learning rate dependencies.

Findings

01

Regret is bounded by O(k log(k) log(n) / η) under specified learning rate.

02

Optimal learning rate η is proportional to Δ_min^2 / (Δ_max log(n)).

03

Analysis confirms the effectiveness of policy gradient in discrete stochastic bandits.

Abstract

We adapt the analysis of policy gradient for continuous time $k$ -armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate $η = O (Δ_{m i n}^{2} / (Δ_{m a x} lo g (n)))$ the regret is $O (k lo g (k) lo g (n) / η)$ where $n$ is the horizon and $Δ_{m i n}$ and $Δ_{m a x}$ are the minimum and maximum gaps.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.