PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

Ruizhi Zhang; Ye Huang; Yuangang Pan; Chuanfu Shen; Zhilin Liu; Ting Xie; Wen Li; Lixin Duan

arXiv:2604.08340·cs.CV·April 10, 2026

PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

Ruizhi Zhang, Ye Huang, Yuangang Pan, Chuanfu Shen, Zhilin Liu, Ting Xie, Wen Li, Lixin Duan

PDF

1 Repo

TL;DR

PokeGym introduces a comprehensive, visually-driven long-horizon benchmark within a 3D game environment to evaluate vision-language models' capabilities in complex tasks, emphasizing autonomous exploration and visual reasoning.

Contribution

It presents a novel benchmark with strict visual-only operation, diverse tasks, and automated evaluation, revealing key limitations of current VLMs and highlighting areas for architectural improvements.

Findings

01

Current VLMs struggle with physical deadlock recovery.

02

Weaker models often fail to recognize entrapment (Unaware Deadlocks).

03

Advanced models recognize deadlocks but cannot recover (Aware Deadlocks).

Abstract

While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.