Parallelizing Thompson Sampling
Amin Karbasi, Vahab Mirrokni, Mohammad Shadravan

TL;DR
This paper introduces a batch Thompson Sampling method that significantly reduces the number of interaction rounds in online decision problems while maintaining optimal regret bounds, by dynamically adjusting batch sizes.
Contribution
It proposes a dynamic batch Thompson Sampling framework that achieves logarithmic batch queries without sacrificing asymptotic regret optimality.
Findings
Achieves same regret bounds as fully sequential methods
Reduces number of interactions from T to O(log T)
Dynamic batch allocation outperforms static batching in experiments
Abstract
How can we make use of information parallelism in online decision making problems while efficiently balancing the exploration-exploitation trade-off? In this paper, we introduce a batch Thompson Sampling framework for two canonical online decision making problems, namely, stochastic multi-arm bandit and linear contextual bandit with finitely many arms. Over a time horizon , our \textit{batch} Thompson Sampling policy achieves the same (asymptotic) regret bound of a fully sequential one while carrying out only batch queries. To achieve this exponential reduction, i.e., reducing the number of interactions from to , our batch policy dynamically determines the duration of each batch in order to balance the exploration-exploitation trade-off. We also demonstrate experimentally that dynamic batch allocation dramatically outperforms natural baselines such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Model Reduction and Neural Networks
