Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors
Junya Honda, Akimichi Takemura

TL;DR
This paper investigates the conditions under which Thompson sampling is asymptotically optimal in Gaussian bandit models, emphasizing the critical role of prior selection, especially in multiparameter settings.
Contribution
It proves that Thompson sampling with a uniform prior achieves the optimal regret bound in Gaussian bandits, while other non-informative priors do not, highlighting prior choice importance.
Findings
Uniform prior achieves asymptotic optimality.
Jeffreys and reference priors do not achieve the bound.
Prior selection critically affects Thompson sampling performance.
Abstract
In stochastic bandit problems, a Bayesian policy called Thompson sampling (TS) has recently attracted much attention for its excellent empirical performance. However, the theoretical analysis of this policy is difficult and its asymptotic optimality is only proved for one-parameter models. In this paper we discuss the optimality of TS for the model of normal distributions with unknown means and variances as one of the most fundamental example of multiparameter models. First we prove that the expected regret of TS with the uniform prior achieves the theoretical bound, which is the first result to show that the asymptotic bound is achievable for the normal distribution model. Next we prove that TS with Jeffreys prior and reference prior cannot achieve the theoretical bound. Therefore the choice of priors is important for TS and non-informative priors are sometimes risky in cases of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Hemodynamic Monitoring and Therapy
