Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games
Songtao Feng, Ming Yin, Yu-Xiang Wang, Jing Yang, Yingbin Liang

TL;DR
This paper introduces a model-free Q-learning algorithm for zero-sum Markov games that achieves the same optimal sample complexity as model-based methods, significantly improving sample efficiency in multi-agent reinforcement learning.
Contribution
It presents the first model-free algorithm matching the optimal sample complexity of model-based algorithms for zero-sum Markov games, using a novel value function update technique.
Findings
Achieves optimal $H$-dependence in sample complexity.
Uses variance reduction with reference-advantage decomposition.
Introduces optimistic and pessimistic value functions for better efficiency.
Abstract
The problem of two-player zero-sum Markov games has recently attracted increasing interests in theoretical studies of multi-agent reinforcement learning (RL). In particular, for finite-horizon episodic Markov decision processes (MDPs), it has been shown that model-based algorithms can find an -optimal Nash Equilibrium (NE) with the sample complexity of , which is optimal in the dependence of the horizon and the number of states (where and denote the number of actions of the two players, respectively). However, none of the existing model-free algorithms can achieve such an optimality. In this work, we propose a model-free stage-based Q-learning algorithm and show that it achieves the same sample complexity as the best model-based algorithm, and hence for the first time demonstrate that model-free algorithms can enjoy the same optimality in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Game Theory and Applications
MethodsNone · Q-Learning
