Thompson Sampling For Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox
Raymond Zhang, Richard Combes

TL;DR
This paper introduces a novel Thompson Sampling algorithm for linear combinatorial semi-bandits with polynomial regret bounds, and uncovers a paradox where incorrect sampling can outperform correct posterior sampling.
Contribution
It presents the first Thompson Sampling method with finite-time regret not exponential in dimension and reveals the mismatched sampling paradox in bandit algorithms.
Findings
Thompson Sampling achieves polynomial regret in combinatorial bandits.
Incorrectly matched sampling can outperform correct posterior sampling.
Code for experiments is publicly available.
Abstract
We consider Thompson Sampling (TS) for linear combinatorial semi-bandits and subgaussian rewards. We propose the first known TS whose finite-time regret does not scale exponentially with the dimension of the problem. We further show the "mismatched sampling paradox": A learner who knows the rewards distributions and samples from the correct posterior distribution can perform exponentially worse than a learner who does not know the rewards and simply samples from a well-chosen Gaussian posterior. The code used to generate the experiments is available at https://github.com/RaymZhang/CTS-Mismatched-Paradox
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Sparse and Compressive Sensing Techniques
MethodsSpatio-temporal stability analysis
