GTO Wizard Benchmark
Marc-Antoine Provost, Nejc Ilenic, Christopher Solinas, and Philippe Beardsell

TL;DR
The GTO Wizard Benchmark provides a standardized framework for evaluating poker algorithms against a superhuman AI, addressing variance issues and benchmarking large language models' reasoning in multi-agent, partially observable environments.
Contribution
It introduces a public API and evaluation framework for HUNL poker, incorporating variance reduction techniques and benchmarking LLMs' reasoning capabilities.
Findings
GTO Wizard AI outperforms previous benchmarks by $19.4$ bb/100.
Variance reduction with AIVAT improves evaluation efficiency tenfold.
LLMs show progress but still lag behind specialized poker agents.
Abstract
We introduce GTO Wizard Benchmark, a public API and standardized evaluation framework for benchmarking algorithms in Heads-Up No-Limit Texas Hold'em (HUNL). The benchmark evaluates agents against GTO Wizard AI, a state-of-the-art superhuman poker agent that approximates Nash Equilibria, and defeated Slumbot, the 2018 Annual Computer Poker Competition champion and previous strongest publicly accessible HUNL benchmark, by bb/100. Variance is a fundamental challenge in poker evaluation; we address this by integrating AIVAT, a provably unbiased variance reduction technique that achieves equivalent statistical significance with ten times fewer hands than naive Monte Carlo evaluation. We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Sports Analytics and Performance · Reinforcement Learning in Robotics
