Gap-Dependent Bounds for Two-Player Markov Games
Zehao Dou, Zhuoran Yang, Zhaoran Wang, Simon S.Du

TL;DR
This paper establishes the first gap-dependent logarithmic regret bounds for Nash Q-learning in two-player Markov games, advancing theoretical understanding in both episodic and discounted settings.
Contribution
It introduces the first gap-dependent logarithmic upper bounds for Nash Q-learning in two-player stochastic Markov games, including extensions to discounted and linear MDP settings.
Findings
Logarithmic regret bounds in episodic tabular setting
Extension of bounds to discounted infinite horizon setting
Logarithmic regret under linear MDP assumption
Abstract
As one of the most popular methods in the field of reinforcement learning, Q-learning has received increasing attention. Recently, there have been more theoretical works on the regret bound of algorithms that belong to the Q-learning class in different settings. In this paper, we analyze the cumulative regret when conducting Nash Q-learning algorithm on 2-player turn-based stochastic Markov games (2-TBSG), and propose the very first gap dependent logarithmic upper bounds in the episodic tabular setting. This bound matches the theoretical lower bound only up to a logarithmic term. Furthermore, we extend the conclusion to the discounted game setting with infinite horizon and propose a similar gap dependent logarithmic regret bound. Also, under the linear MDP assumption, we obtain another logarithmic regret for 2-TBSG, in both centralized and independent settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems
MethodsQ-Learning
