Anytime PSRO for Two-Player Zero-Sum Games
Stephen McAleer, Kevin Wang, John Lanier, Marc Lanctot, Pierre Baldi,, Tuomas Sandholm, Roy Fox

TL;DR
This paper introduces Anytime PSRO, an algorithm for two-player zero-sum games that guarantees convergence to Nash equilibrium while monotonically reducing exploitability, improving over existing methods like PSRO and DO.
Contribution
The paper proposes ADO and APSRO algorithms that ensure convergence to Nash equilibrium and decrease exploitability monotonically in two-player zero-sum games.
Findings
Achieves lower exploitability than DO and PSRO.
Monotonically decreases exploitability over iterations.
Effective in Leduc poker and random normal form games.
Abstract
Policy space response oracles (PSRO) is a multi-agent reinforcement learning algorithm that has achieved state-of-the-art performance in very large two-player zero-sum games. PSRO is based on the tabular double oracle (DO) method, an algorithm that is guaranteed to converge to a Nash equilibrium, but may increase exploitability from one iteration to the next. We propose anytime double oracle (ADO), a tabular double oracle algorithm for 2-player zero-sum games that is guaranteed to converge to a Nash equilibrium while decreasing exploitability from one iteration to the next. Unlike DO, in which the restricted distribution is based on the restricted game formed by each player's strategy sets, ADO finds the restricted distribution for each player that minimizes its exploitability against any policy in the full, unrestricted game. We also propose a method of finding this restricted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Gambling Behavior and Treatments · Reinforcement Learning in Robotics
