Generalized Regret Analysis of Thompson Sampling using Fractional Posteriors
Prateek Jaiswal, Debdeep Pati, Anirban Bhattacharya, Bani K. Mallick

TL;DR
This paper introduces a fractional posterior variant of Thompson sampling, called $oldsymbol{ ext{ extalpha}- ext{TS}}$, providing new regret bounds for stochastic multi-armed bandits under mild conditions, expanding theoretical understanding of Bayesian algorithms.
Contribution
It develops a generalized regret analysis for $ ext{ extalpha}- ext{TS}$ using fractional posteriors, with bounds applicable to broad reward models and priors, without requiring conjugacy.
Findings
Achieves both instance-dependent and independent regret bounds.
Matches the regret bounds of improved UCB algorithms.
Applicable to sub-Gaussian and exponential family reward models.
Abstract
Thompson sampling (TS) is one of the most popular and earliest algorithms to solve stochastic multi-armed bandit problems. We consider a variant of TS, named -TS, where we use a fractional or -posterior () instead of the standard posterior distribution. To compute an -posterior, the likelihood in the definition of the standard posterior is tempered with a factor . For -TS we obtain both instance-dependent and instance-independent frequentist regret bounds under very mild conditions on the prior and reward distributions, where is the gap between the true mean rewards of the and the best arms, and is a known constant. Both the sub-Gaussian and exponential…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Sparse and Compressive Sensing Techniques · Machine Learning and Algorithms
MethodsSpatio-temporal stability analysis
