Double Thompson Sampling in Finite stochastic Games

Shuqing Shi; Xiaobin Wang; Zhiyou Yang; Fan Zhang; Hong Qu

arXiv:2202.10008·cs.LG·March 1, 2022

Double Thompson Sampling in Finite stochastic Games

Shuqing Shi, Xiaobin Wang, Zhiyou Yang, Fan Zhang, Hong Qu

PDF

Open Access

TL;DR

This paper introduces a double Thompson sampling algorithm for finite discounted Markov Decision Processes, achieving the best known regret bounds and effectively balancing exploration and exploitation.

Contribution

The paper proposes a novel double Thompson sampling reinforcement learning algorithm with superior regret bounds for finite stochastic games.

Findings

01

Achieves a regret bound of D\u221aSAT for the problem.

02

Establishes a regret bound of (T)/S^2, the best for this setting.

03

Numerical results demonstrate the algorithm's efficiency and superiority.

Abstract

We consider the trade-off problem between exploration and exploitation under finite discounted Markov Decision Process, where the state transition matrix of the underlying environment stays unknown. We propose a double Thompson sampling reinforcement learning algorithm(DTS) to solve this kind of problem. This algorithm achieves a total regret bound of $\tilde{O} (D S A T)$ in time horizon $T$ with $S$ states, $A$ actions and diameter $D$ . DTS consists of two parts, the first part is the traditional part where we apply the posterior sampling method on transition matrix based on prior distribution. In the second part, we employ a count-based posterior update method to balance between the local optimal action and the long-term optimal action in order to find the global optimal game value. We established a regret bound of $\tilde{O} (T / S^{2})$ . Which is by far…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization