Analysis of Thompson Sampling for Partially Observable Contextual Multi-Armed Bandits
Hongju Park, Mohamad Kazem Shirani Faradonbeh

TL;DR
This paper introduces a Thompson Sampling algorithm tailored for partially observable contextual multi-armed bandits, providing theoretical guarantees on regret and learning rates, with empirical validation.
Contribution
It develops a novel Thompson Sampling approach for partially observable contexts and proves regret bounds and learning rates, extending existing methods to more realistic scenarios.
Findings
Regret scales logarithmically with time and number of arms.
Regret scales linearly with the dimension of the context.
Numerical analyses support theoretical results.
Abstract
Contextual multi-armed bandits are classical models in reinforcement learning for sequential decision-making associated with individual information. A widely-used policy for bandits is Thompson Sampling, where samples from a data-driven probabilistic belief about unknown parameters are used to select the control actions. For this computationally fast algorithm, performance analyses are available under full context-observations. However, little is known for problems that contexts are not fully observed. We propose a Thompson Sampling algorithm for partially observable contextual multi-armed bandits, and establish theoretical performance guarantees. Technically, we show that the regret of the presented policy scales logarithmically with time and the number of arms, and linearly with the dimension. Further, we establish rates of learning unknown parameters, and provide illustrative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Machine Learning and Algorithms
