Neural Thompson Sampling
Weitong Zhang, Dongruo Zhou, Lihong Li, Quanquan Gu

TL;DR
Neural Thompson Sampling introduces a neural network-based approach for contextual bandits, providing theoretical regret guarantees and demonstrating superior performance through experiments on various datasets.
Contribution
It develops a novel neural network-based posterior distribution for Thompson Sampling, combining deep learning with regret guarantees in contextual bandit problems.
Findings
Achieves a cumulative regret of O(T^{1/2}) under bounded reward functions.
Experimental results outperform benchmark bandit algorithms on multiple datasets.
Theoretical analysis confirms the regret bound for the proposed method.
Abstract
Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network. We prove that, provided the underlying reward function is bounded, the proposed algorithm is guaranteed to achieve a cumulative regret of , which matches the regret of other contextual bandit algorithms in terms of total round number . Experimental comparisons with other benchmark bandit algorithms on various data sets corroborate our theory.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
