The Typical Behavior of Bandit Algorithms
Lin Fan, Peter W. Glynn

TL;DR
This paper establishes strong laws of large numbers and central limit theorems for the regret of Thompson sampling and UCB bandit algorithms, characterizing their typical behavior and fluctuations over time.
Contribution
It provides the first CLT and SLLN results for these algorithms' regret, complementing previous tail distribution characterizations and revealing their asymptotic variance structure.
Findings
Both algorithms satisfy the same SLLN and CLT.
The mean and variance grow logarithmically with time horizon.
Variability in sub-optimal arm plays depends only on received rewards.
Abstract
We establish strong laws of large numbers and central limit theorems for the regret of two of the most popular bandit algorithms: Thompson sampling and UCB. Here, our characterizations of the regret distribution complement the characterizations of the tail of the regret distribution recently developed by Fan and Glynn (2021) (arXiv:2109.13595). The tail characterizations there are associated with atypical bandit behavior on trajectories where the optimal arm mean is under-estimated, leading to mis-identification of the optimal arm and large regret. In contrast, our SLLN's and CLT's here describe the typical behavior and fluctuation of regret on trajectories where the optimal arm mean is properly estimated. We find that Thompson sampling and UCB satisfy the same SLLN and CLT, with the asymptotics of both the SLLN and the (mean) centering sequence in the CLT matching the asymptotics of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Forecasting Techniques and Applications
