Reevaluating Policy Gradient Methods for Imperfect-Information Games

Max Rudolph; Nathan Lichtle; Sobhan Mohammadpour; Alexandre Bayen; J. Zico Kolter; Amy Zhang; Gabriele Farina; Eugene Vinitsky; Samuel Sokota

arXiv:2502.08938·cs.LG·March 18, 2026

Reevaluating Policy Gradient Methods for Imperfect-Information Games

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J. Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, Samuel Sokota

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper compares various deep reinforcement learning algorithms in imperfect-information games and finds that simpler policy gradient methods like PPO outperform more complex approaches based on fictitious play, double oracle, and CFR.

Contribution

It provides the first large-scale exploitability comparison of DRL algorithms in imperfect-information games, demonstrating the competitiveness of generic policy gradient methods.

Findings

01

PPO and similar methods outperform FP, DO, and CFR-based approaches.

02

Conducted over 7000 training runs for robust comparison.

03

Provided the first accessible exploitability computation framework for large games.

Abstract

In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP-, DO-, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for five large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 7000 training runs, we find that FP-, DO-, and CFR-based approaches fail to outperform…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

This is important work! It's very important to be sure the community is not trapped by oft-repeated claims that are just known, but not truly validated. The goals of this paper are important. But, what is most impressive about the paper is the carefulness in the transparency of the comparison, recognition of their own biases, aim for reproducibility, and the limitations of what the empirical results justify. This is commendable. The following line is one of my favourites a paper focused on

Weaknesses

I think the biggest weakness of the paper is the paper comes at this work with a very strong bias and preconceived expectation to the result. I am glad the authors recognize this bias, but I think that bias gives an edge ot the writing that is not needed in the text. I believe the authors gave all algorithms a fair shake. More importantly they are transparent about the shake they gave them. But you don't get that picture through the first half of the paper. I found myself immediately bristl

Reviewer 02Rating 4Confidence 4

Strengths

1. This paper makes significant efforts in aligning the existing game RL algorithms and fairly evaluating them under the strict exploitability measure. 2. The elaborations on the experimental settings and choice of benchmarks make the conclusion convincing.

Weaknesses

1. My major concern about this paper is that the proposed hypothesis is not novel. To me, the hypothesis is somewhat equivalent to saying last-iterate convergence is superior to average-iterate convergence under current game RL implementions. I think it has been common sense since recent game-theoretic research has a significant focus on last-iterate convergence (e.g., [1,2]), which usually implies a linear convergence rate. In some small games, the last iterates can converge exponentially faste

Reviewer 03Rating 4Confidence 4

Strengths

* The paper presents interesting and important empirical evidence that may encourage people to try a simpler and more standard solution to solve their problems. * The paper makes a rigorous effort to compare to several baselines. * The paper promises to publish a more efficient BR computation for OpenSpiel, which might be useful

Weaknesses

Summary: I believe the empirical results are interesting and likely worth publishing. However, the writing of the paper is not ideal. The authors try to condense the main message of the paper to a short paragraph they call a "hypothesis". However, for me, a hypothesis is a formal statement that can be proven to be true or false. The "hypothesis" in this paper is a very vague observation that is subject to many interpretations and many of them are clearly not true in general. Moreover, there ar

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Auction Theory and Applications · Simulation Techniques and Applications

MethodsEntropy Regularization · Proximal Policy Optimization