A Policy-Gradient Approach to Solving Imperfect-Information Games with Best-Iterate Convergence

Mingyang Liu; Gabriele Farina; Asuman Ozdaglar

arXiv:2408.00751·cs.GT·July 10, 2025

A Policy-Gradient Approach to Solving Imperfect-Information Games with Best-Iterate Convergence

Mingyang Liu, Gabriele Farina, Asuman Ozdaglar

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that a policy gradient method can be used in two-player zero-sum imperfect-information games to provably converge to a Nash equilibrium, bridging a gap between reinforcement learning and game theory.

Contribution

It introduces a policy gradient approach with theoretical guarantees of convergence in imperfect-information extensive-form games, a novel result in the field.

Findings

01

Proves policy gradient convergence to Nash equilibrium in self-play

02

Establishes best-iterate convergence guarantees

03

Bridges reinforcement learning and game theory in imperfect-information settings

Abstract

Policy gradient methods have become a staple of any single-agent reinforcement learning toolbox, due to their combination of desirable properties: iterate convergence, efficient use of stochastic trajectory feedback, and theoretically-sound avoidance of importance sampling corrections. In multi-agent imperfect-information settings (extensive-form games), however, it is still unknown whether the same desiderata can be guaranteed while retaining theoretical guarantees. Instead, sound methods for extensive-form games rely on approximating \emph{counterfactual} values (as opposed to Q values), which are incompatible with policy gradient methodologies. In this paper, we investigate whether policy gradient can be safely used in two-player zero-sum imperfect-information extensive-form games (EFGs). We establish positive results, showing for the first time that a policy gradient method leads to…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The analysis is sound with sublinear iterate average convergence. Proposed idea is novel and easy to follow.

Weaknesses

Analysis section: - Although it makes sense to enforce strong convexity to the bilinear objective via regularization for easier analysis and stronger convergence guarantee, it also brings a bias to the equilibrium. As the authors are introducing a new regularization as their part of the novelty, it is also expected that the authors show how large this bias is. Experiment section: - As the authors mentioned in intro, the motivation behind proposing and proving the convergence of policy gradien

Reviewer 02Rating 5Confidence 2

Strengths

The idea of using Q-value in regret minimization is reasonable. The writing is clear and and the paper is easy to follow.

Weaknesses

The proposed algorithm combines optimistic mirror descent updates and estimating Q-values from rollouts, both of which seem to be well known techniques. The technical novelty might be limited. Motivations seem to be disconnected with later sections, such as experiments.

Reviewer 03Rating 6Confidence 3

Strengths

- The contribution of the paper is strong, where the proposed approach only requires sampling of randomly generated trajectories (as opposed to using importance sampling) to estimate value function and compute policy. The approach is proved to have best-iterate convergence guarantees to the Nash equilibria of the regularized game under both full and imperfect information. - Clear introduction of related works and main obstacles in the field as well as contribution statement. I really enjoyed re

Weaknesses

1. Line 280: The introduction of the algorithm can be more detailed. Right now, it is only a few lines and assumes the readers are already familiar with the references. For example, the author(s) ca n do a better job walking the readers through their update of the regularizer at Line 8. 2. The exploitability metric used in the experiment section is undefined. The experiment section can be better presented: what makes the proposed approach outperform a certain baseline in a certain setting? Why i

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuction Theory and Applications · Advanced Bandit Algorithms Research · Economic theories and models