Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games
Weichao Mao, Tamer Ba\c{s}ar

TL;DR
This paper introduces a decentralized reinforcement learning algorithm for general-sum Markov games that efficiently finds approximate coarse correlated equilibria with proven sample complexity bounds, advancing multi-agent learning theory.
Contribution
It presents the first sample complexity result for decentralized learning in general-sum Markov games using a novel combination of optimistic V-learning and online mirror descent.
Findings
Achieves an $ ilde{O}(H^6 S A / \epsilon^2)$ sample complexity for $\\epsilon$-approximate CCE.
The algorithm is fully decentralized, scalable, and requires only local information.
Introduces a new high-probability regret bound for online mirror descent with dynamic learning rates.
Abstract
This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games through decentralized multi-agent reinforcement learning. Given the fundamental difficulty of calculating a Nash equilibrium (NE), we instead aim at finding a coarse correlated equilibrium (CCE), a solution concept that generalizes NE by allowing possible correlations among the agents' strategies. We propose an algorithm in which each agent independently runs optimistic V-learning (a variant of Q-learning) to efficiently explore the unknown environment, while using a stabilized online mirror descent (OMD) subroutine for policy updates. We show that the agents can find an -approximate CCE in at most episodes, where is the number of states, is the size of the largest individual action space, and is the length of an episode. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization
