A Policy-Gradient Approach to Solving Imperfect-Information Games with Best-Iterate Convergence
Mingyang Liu, Gabriele Farina, Asuman Ozdaglar

TL;DR
This paper demonstrates that a policy gradient method can be used in two-player zero-sum imperfect-information games to provably converge to a Nash equilibrium, bridging a gap between reinforcement learning and game theory.
Contribution
It introduces a policy gradient approach with theoretical guarantees of convergence in imperfect-information extensive-form games, a novel result in the field.
Findings
Proves policy gradient convergence to Nash equilibrium in self-play
Establishes best-iterate convergence guarantees
Bridges reinforcement learning and game theory in imperfect-information settings
Abstract
Policy gradient methods have become a staple of any single-agent reinforcement learning toolbox, due to their combination of desirable properties: iterate convergence, efficient use of stochastic trajectory feedback, and theoretically-sound avoidance of importance sampling corrections. In multi-agent imperfect-information settings (extensive-form games), however, it is still unknown whether the same desiderata can be guaranteed while retaining theoretical guarantees. Instead, sound methods for extensive-form games rely on approximating \emph{counterfactual} values (as opposed to Q values), which are incompatible with policy gradient methodologies. In this paper, we investigate whether policy gradient can be safely used in two-player zero-sum imperfect-information extensive-form games (EFGs). We establish positive results, showing for the first time that a policy gradient method leads to…
Peer Reviews
Decision·ICLR 2025 Poster
The analysis is sound with sublinear iterate average convergence. Proposed idea is novel and easy to follow.
Analysis section: - Although it makes sense to enforce strong convexity to the bilinear objective via regularization for easier analysis and stronger convergence guarantee, it also brings a bias to the equilibrium. As the authors are introducing a new regularization as their part of the novelty, it is also expected that the authors show how large this bias is. Experiment section: - As the authors mentioned in intro, the motivation behind proposing and proving the convergence of policy gradien
The idea of using Q-value in regret minimization is reasonable. The writing is clear and and the paper is easy to follow.
The proposed algorithm combines optimistic mirror descent updates and estimating Q-values from rollouts, both of which seem to be well known techniques. The technical novelty might be limited. Motivations seem to be disconnected with later sections, such as experiments.
- The contribution of the paper is strong, where the proposed approach only requires sampling of randomly generated trajectories (as opposed to using importance sampling) to estimate value function and compute policy. The approach is proved to have best-iterate convergence guarantees to the Nash equilibria of the regularized game under both full and imperfect information. - Clear introduction of related works and main obstacles in the field as well as contribution statement. I really enjoyed re
1. Line 280: The introduction of the algorithm can be more detailed. Right now, it is only a few lines and assumes the readers are already familiar with the references. For example, the author(s) ca n do a better job walking the readers through their update of the regularizer at Line 8. 2. The exploitability metric used in the experiment section is undefined. The experiment section can be better presented: what makes the proposed approach outperform a certain baseline in a certain setting? Why i
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuction Theory and Applications · Advanced Bandit Algorithms Research · Economic theories and models
