Analysis of Thompson Sampling for the multi-armed bandit problem
Shipra Agrawal, Navin Goyal

TL;DR
This paper provides the first theoretical proof that Thompson Sampling achieves logarithmic expected regret in multi-armed bandit problems, confirming its near-optimal performance and efficiency.
Contribution
It offers the first rigorous analysis showing Thompson Sampling's logarithmic regret bounds for multi-armed bandits, advancing theoretical understanding of this Bayesian algorithm.
Findings
Thompson Sampling achieves logarithmic expected regret in two-armed bandits.
Expected regret bounds are proven to be optimal up to constants.
The analysis extends to N-armed bandits with specific regret bounds.
Abstract
The multi-armed bandit problem is a popular model for studying exploration/exploitation trade-off in sequential decision problems. Many algorithms are now available for this well-studied problem. One of the earliest algorithms, given by W. R. Thompson, dates back to 1933. This algorithm, referred to as Thompson Sampling, is a natural Bayesian algorithm. The basic idea is to choose an arm to play according to its probability of being the best arm. Thompson Sampling algorithm has experimentally been shown to be close to optimal. In addition, it is efficient to implement and exhibits several desirable properties such as small regret for delayed feedback. However, theoretical understanding of this algorithm was quite limited. In this paper, for the first time, we show that Thompson Sampling algorithm achieves logarithmic expected regret for the multi-armed bandit problem. More precisely,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Machine Learning and Algorithms
