Gap-Dependent Bounds for Two-Player Markov Games

Zehao Dou; Zhuoran Yang; Zhaoran Wang; Simon S.Du

arXiv:2107.00685·cs.LG·July 5, 2021

Gap-Dependent Bounds for Two-Player Markov Games

Zehao Dou, Zhuoran Yang, Zhaoran Wang, Simon S.Du

PDF

Open Access

TL;DR

This paper establishes the first gap-dependent logarithmic regret bounds for Nash Q-learning in two-player Markov games, advancing theoretical understanding in both episodic and discounted settings.

Contribution

It introduces the first gap-dependent logarithmic upper bounds for Nash Q-learning in two-player stochastic Markov games, including extensions to discounted and linear MDP settings.

Findings

01

Logarithmic regret bounds in episodic tabular setting

02

Extension of bounds to discounted infinite horizon setting

03

Logarithmic regret under linear MDP assumption

Abstract

As one of the most popular methods in the field of reinforcement learning, Q-learning has received increasing attention. Recently, there have been more theoretical works on the regret bound of algorithms that belong to the Q-learning class in different settings. In this paper, we analyze the cumulative regret when conducting Nash Q-learning algorithm on 2-player turn-based stochastic Markov games (2-TBSG), and propose the very first gap dependent logarithmic upper bounds in the episodic tabular setting. This bound matches the theoretical lower bound only up to a logarithmic term. Furthermore, we extend the conclusion to the discounted game setting with infinite horizon and propose a similar gap dependent logarithmic regret bound. Also, under the linear MDP assumption, we obtain another logarithmic regret for 2-TBSG, in both centralized and independent settings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems

MethodsQ-Learning