The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

Seth Karten; Jake Grigsby; Tersoo Upaa Jr; Junik Bae; Seonghun Hong; Hyunyoung Jeong; Jaeyoon Jung; Kun Kerdthaisong; Gyungbo Kim; Hyeokgi Kim; Yujin Kim; Eunju Kwon; Dongyu Liu; Patrick Mariglia; Sangyeon Park; Benedikt Schink; Xianwei Shi; Anthony Sistilli; Joseph Twin; Arian Urdu; Matin Urdu; Qiao Wang; Ling Wu; Wenli Zhang; Kunsheng Zhou; Stephanie Milani; Kiran Vodrahalli; Amy Zhang; Fei Fang; Yuke Zhu; Chi Jin

arXiv:2603.15563·cs.LG·March 18, 2026

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

Seth Karten, Jake Grigsby, Tersoo Upaa Jr, Junik Bae, Seonghun Hong, Hyunyoung Jeong, Jaeyoon Jung, Kun Kerdthaisong, Gyungbo Kim, Hyeokgi Kim, Yujin Kim, Eunju Kwon, Dongyu Liu, Patrick Mariglia, Sangyeon Park, Benedikt Schink, Xianwei Shi, Anthony Sistilli, Joseph Twin

PDF

Open Access

TL;DR

The PokeAgent Challenge introduces a large-scale Pokemon-based benchmark for decision-making research, addressing partial observability, game-theoretic reasoning, and long-horizon planning through two tracks, fostering advancements in AI capabilities.

Contribution

It presents the first comprehensive Pokemon-based benchmark with extensive datasets, standardized evaluation frameworks, and a NeurIPS 2025 competition to advance AI research in complex, realistic environments.

Findings

01

Over 100 teams participated, showing significant gaps between AI and human performance.

02

Pokemon battling capabilities are orthogonal to existing LLM benchmarks.

03

The benchmark reveals new AI challenges in partial observability and long-horizon planning.

Abstract

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)