Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning
Tong Zhang

TL;DR
This paper introduces Feel-Good Thompson Sampling, a modified approach for contextual bandits that improves exploration and achieves optimal regret bounds, extending to linear and some MDP problems.
Contribution
It proposes Feel-Good Thompson Sampling, providing a theoretical framework with regret bounds that match minimax lower bounds and extends to linear and MDP settings.
Findings
Feel-Good Thompson Sampling improves exploration in contextual bandits.
Theoretical regret bounds match minimax lower bounds.
Framework extends to linear and certain MDP problems.
Abstract
Thompson Sampling has been widely used for contextual bandit problems due to the flexibility of its modeling power. However, a general theory for this class of methods in the frequentist setting is still lacking. In this paper, we present a theoretical analysis of Thompson Sampling, with a focus on frequentist regret bounds. In this setting, we show that the standard Thompson Sampling is not aggressive enough in exploring new actions, leading to suboptimality in some pessimistic situations. A simple modification called Feel-Good Thompson Sampling, which favors high reward models more aggressively than the standard Thompson Sampling, is proposed to remedy this problem. We show that the theoretical framework can be used to derive Bayesian regret bounds for standard Thompson Sampling, and frequentist regret bounds for Feel-Good Thompson Sampling. It is shown that in both cases, we can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Smart Grid Energy Management
