Decentralized model-free reinforcement learning in stochastic games with average-reward objective
Romain Cravic, Nicolas Gast, Bruno Gaujal

TL;DR
This paper introduces DONQ-learning, a model-free algorithm for decentralized two-player zero-sum stochastic games with average reward, achieving low regret with efficient computation and memory use.
Contribution
It presents the first decentralized model-free algorithm with provable low regret guarantees for infinite-horizon stochastic games under average reward.
Findings
Achieves sublinear regret of order T^{3/4} with high probability.
Achieves sublinear expected regret of order T^{2/3}.
Has low computational complexity and memory requirements.
Abstract
We propose the first model-free algorithm that achieves low regret performance for decentralized learning in two-player zero-sum tabular stochastic games with infinite-horizon average-reward objective. In decentralized learning, the learning agent controls only one player and tries to achieve low regret performances against an arbitrary opponent. This contrasts with centralized learning where the agent tries to approximate the Nash equilibrium by controlling both players. In our infinite-horizon undiscounted setting, additional structure assumptions is needed to provide good behaviors of learning processes : here we assume for every strategy of the opponent, the agent has a way to go from any state to any other. This assumption is the analogous to the "communicating" assumption in the MDP setting. We show that our Decentralized Optimistic Nash Q-Learning (DONQ-learning) algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization
MethodsQ-Learning
