Double Thompson Sampling in Finite stochastic Games
Shuqing Shi, Xiaobin Wang, Zhiyou Yang, Fan Zhang, Hong Qu

TL;DR
This paper introduces a double Thompson sampling algorithm for finite discounted Markov Decision Processes, achieving the best known regret bounds and effectively balancing exploration and exploitation.
Contribution
The paper proposes a novel double Thompson sampling reinforcement learning algorithm with superior regret bounds for finite stochastic games.
Findings
Achieves a regret bound of D\u221aSAT for the problem.
Establishes a regret bound of (T)/S^2, the best for this setting.
Numerical results demonstrate the algorithm's efficiency and superiority.
Abstract
We consider the trade-off problem between exploration and exploitation under finite discounted Markov Decision Process, where the state transition matrix of the underlying environment stays unknown. We propose a double Thompson sampling reinforcement learning algorithm(DTS) to solve this kind of problem. This algorithm achieves a total regret bound of in time horizon with states, actions and diameter . DTS consists of two parts, the first part is the traditional part where we apply the posterior sampling method on transition matrix based on prior distribution. In the second part, we employ a count-based posterior update method to balance between the local optimal action and the long-term optimal action in order to find the global optimal game value. We established a regret bound of . Which is by far…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization
