Independent Policy Gradient Methods for Competitive Reinforcement Learning
Constantinos Daskalakis, Dylan J. Foster, Noah Golowich

TL;DR
This paper proves that independent policy gradient algorithms in two-player zero-sum reinforcement learning converge to equilibrium under certain learning rate conditions, providing the first finite-sample guarantees for such decentralized methods.
Contribution
It establishes the first finite-sample convergence guarantees for independent policy gradient methods in competitive reinforcement learning settings.
Findings
Policies converge to a min-max equilibrium with two-timescale learning rates
First finite-sample convergence result for independent policy gradient in competitive RL
Independent algorithms can achieve equilibrium without centralized coordination
Abstract
We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents (i.e., zero-sum stochastic games). We consider an episodic setting where in each episode, each player independently selects a policy and observes only their own actions and rewards, along with the state. We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule (which is necessary). To the best of our knowledge, this constitutes the first finite-sample convergence result for independent policy gradient methods in competitive RL; prior work has largely focused on centralized, coordinated procedures for equilibrium computation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Advanced Thermodynamics and Statistical Mechanics
