A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits
Tor Lattimore

TL;DR
This paper extends continuous-time policy gradient analysis to discrete-time stochastic bandits, providing regret bounds based on Lyapunov methods and specific learning rate conditions.
Contribution
It adapts the Lyapunov analysis of policy gradient methods from continuous to discrete time for stochastic bandits, deriving regret bounds with explicit learning rate dependencies.
Findings
Regret is bounded by O(k log(k) log(n) / η) under specified learning rate.
Optimal learning rate η is proportional to Δ_min^2 / (Δ_max log(n)).
Analysis confirms the effectiveness of policy gradient in discrete stochastic bandits.
Abstract
We adapt the analysis of policy gradient for continuous time -armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate the regret is where is the horizon and and are the minimum and maximum gaps.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
