TL;DR
PokeGym introduces a comprehensive, visually-driven long-horizon benchmark within a 3D game environment to evaluate vision-language models' capabilities in complex tasks, emphasizing autonomous exploration and visual reasoning.
Contribution
It presents a novel benchmark with strict visual-only operation, diverse tasks, and automated evaluation, revealing key limitations of current VLMs and highlighting areas for architectural improvements.
Findings
Current VLMs struggle with physical deadlock recovery.
Weaker models often fail to recognize entrapment (Unaware Deadlocks).
Advanced models recognize deadlocks but cannot recover (Aware Deadlocks).
Abstract
While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
