GTO Wizard Benchmark

Marc-Antoine Provost; Nejc Ilenic; Christopher Solinas; and Philippe Beardsell

arXiv:2603.23660·cs.AI·March 26, 2026

GTO Wizard Benchmark

Marc-Antoine Provost, Nejc Ilenic, Christopher Solinas, and Philippe Beardsell

PDF

Open Access

TL;DR

The GTO Wizard Benchmark provides a standardized framework for evaluating poker algorithms against a superhuman AI, addressing variance issues and benchmarking large language models' reasoning in multi-agent, partially observable environments.

Contribution

It introduces a public API and evaluation framework for HUNL poker, incorporating variance reduction techniques and benchmarking LLMs' reasoning capabilities.

Findings

01

GTO Wizard AI outperforms previous benchmarks by $19.4$ bb/100.

02

Variance reduction with AIVAT improves evaluation efficiency tenfold.

03

LLMs show progress but still lag behind specialized poker agents.

Abstract

We introduce GTO Wizard Benchmark, a public API and standardized evaluation framework for benchmarking algorithms in Heads-Up No-Limit Texas Hold'em (HUNL). The benchmark evaluates agents against GTO Wizard AI, a state-of-the-art superhuman poker agent that approximates Nash Equilibria, and defeated Slumbot, the 2018 Annual Computer Poker Competition champion and previous strongest publicly accessible HUNL benchmark, by $19.4$ $\pm$ $4.1$ bb/100. Variance is a fundamental challenge in poker evaluation; we address this by integrating AIVAT, a provably unbiased variance reduction technique that achieves equivalent statistical significance with ten times fewer hands than naive Monte Carlo evaluation. We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Sports Analytics and Performance · Reinforcement Learning in Robotics