Optimism Stabilizes Thompson Sampling for Adaptive Inference
Shunxing Yan, Han Zhong

TL;DR
This paper demonstrates that optimism in Thompson sampling stabilizes its inferential properties in multi-armed bandits, allowing for valid asymptotic inference with minimal regret increase.
Contribution
It extends the stability results of variance-inflated Thompson sampling from two-armed to K-armed bandits and introduces an alternative optimistic modification that also ensures stability.
Findings
Variance-inflated TS is stable for any number of arms.
Optimistic modifications enable valid asymptotic inference.
Stability is achieved with only mild regret cost.
Abstract
Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the -armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any , including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general -armed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference
