Reevaluating Policy Gradient Methods for Imperfect-Information Games
Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J. Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, Samuel Sokota

TL;DR
This paper compares various deep reinforcement learning algorithms in imperfect-information games and finds that simpler policy gradient methods like PPO outperform more complex approaches based on fictitious play, double oracle, and CFR.
Contribution
It provides the first large-scale exploitability comparison of DRL algorithms in imperfect-information games, demonstrating the competitiveness of generic policy gradient methods.
Findings
PPO and similar methods outperform FP, DO, and CFR-based approaches.
Conducted over 7000 training runs for robust comparison.
Provided the first accessible exploitability computation framework for large games.
Abstract
In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP-, DO-, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for five large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 7000 training runs, we find that FP-, DO-, and CFR-based approaches fail to outperform…
Peer Reviews
Decision·ICLR 2026 Poster
This is important work! It's very important to be sure the community is not trapped by oft-repeated claims that are just known, but not truly validated. The goals of this paper are important. But, what is most impressive about the paper is the carefulness in the transparency of the comparison, recognition of their own biases, aim for reproducibility, and the limitations of what the empirical results justify. This is commendable. The following line is one of my favourites a paper focused on
I think the biggest weakness of the paper is the paper comes at this work with a very strong bias and preconceived expectation to the result. I am glad the authors recognize this bias, but I think that bias gives an edge ot the writing that is not needed in the text. I believe the authors gave all algorithms a fair shake. More importantly they are transparent about the shake they gave them. But you don't get that picture through the first half of the paper. I found myself immediately bristl
1. This paper makes significant efforts in aligning the existing game RL algorithms and fairly evaluating them under the strict exploitability measure. 2. The elaborations on the experimental settings and choice of benchmarks make the conclusion convincing.
1. My major concern about this paper is that the proposed hypothesis is not novel. To me, the hypothesis is somewhat equivalent to saying last-iterate convergence is superior to average-iterate convergence under current game RL implementions. I think it has been common sense since recent game-theoretic research has a significant focus on last-iterate convergence (e.g., [1,2]), which usually implies a linear convergence rate. In some small games, the last iterates can converge exponentially faste
* The paper presents interesting and important empirical evidence that may encourage people to try a simpler and more standard solution to solve their problems. * The paper makes a rigorous effort to compare to several baselines. * The paper promises to publish a more efficient BR computation for OpenSpiel, which might be useful
Summary: I believe the empirical results are interesting and likely worth publishing. However, the writing of the paper is not ideal. The authors try to condense the main message of the paper to a short paragraph they call a "hypothesis". However, for me, a hypothesis is a formal statement that can be proven to be true or false. The "hypothesis" in this paper is a very vague observation that is subject to many interpretations and many of them are clearly not true in general. Moreover, there ar
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Auction Theory and Applications · Simulation Techniques and Applications
MethodsEntropy Regularization · Proximal Policy Optimization
