Thompson Sampling for Combinatorial Semi-Bandits
Siwei Wang, Wei Chen

TL;DR
This paper applies Thompson sampling to combinatorial multi-armed bandits, providing improved regret bounds, analyzing the matroid setting, and demonstrating through experiments that TS outperforms existing algorithms.
Contribution
The paper introduces a refined analysis of Thompson sampling for CMAB, achieving tighter regret bounds, extends results to matroid bandits without independence assumptions, and highlights limitations of using approximation oracles.
Findings
Thompson sampling achieves better regret bounds than prior UCB-based methods.
In the matroid bandit setting, regret bounds match the theoretical lower bounds.
Experiments show Thompson sampling outperforms existing algorithms in practice.
Abstract
In this paper, we study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We first analyze the standard TS algorithm for the general CMAB model when the outcome distributions of all the base arms are independent, and obtain a distribution-dependent regret bound of , where is the number of base arms, is the size of the largest super arm, is the time horizon, and is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. This regret upper bound is better than the bound in prior works. Moreover, our novel analysis techniques can help to tighten the regret bounds of other existing UCB-based policies (e.g., ESCB), as we improve the method of counting the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Optimization and Search Problems
MethodsSpatio-temporal stability analysis
