Batched Thompson Sampling
Cem Kalkanli, Ayfer Ozgur

TL;DR
This paper introduces a batched version of Thompson sampling for multi-armed bandits that achieves near-optimal regret bounds with minimal feedback, matching the performance of algorithms with full feedback.
Contribution
It proposes an adaptive batched Thompson sampling policy that maintains optimal regret bounds and requires only a logarithmic number of batches, without prior knowledge of the time horizon.
Findings
Achieves $O( ext{log}(T))$ problem-dependent regret.
Achieves $O( ext{sqrt}(T ext{log}(T)))$ minimax regret.
Uses $O( ext{log}( ext{log}(T)))$ batches in expectation.
Abstract
We introduce a novel anytime Batched Thompson sampling policy for multi-armed bandits where the agent observes the rewards of her actions and adjusts her policy only at the end of a small number of batches. We show that this policy simultaneously achieves a problem dependent regret of order and a minimax regret of order while the number of batches can be bounded by independent of the problem instance over a time horizon . We also show that in expectation the number of batches used by our policy can be bounded by an instance dependent bound of order . These results indicate that Thompson sampling maintains the same performance in this batched setting as in the case when instantaneous feedback is available after each action, while requiring minimal feedback. These results also indicate that Thompson sampling performs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Optimization and Search Problems
