TL;DR
VideoGameBench evaluates vision-language models' ability to play classic video games using only visual inputs and high-level objectives, revealing current models' limited capabilities in real-time gameplay.
Contribution
Introduces VideoGameBench, a novel benchmark for assessing vision-language models on real-time video game tasks with minimal auxiliary information.
Findings
Models complete less than 1% of the games in the benchmark.
Inference latency significantly hampers model performance in real-time settings.
VideoGameBench Lite reduces latency by pausing the game during model inference.
Abstract
Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
See Summary
See Summary
- Challenging environments and strict train/test split are well-needed. - Use of perceptual hash to estimate game completion is interesting.
- Many similar benchmarks already exist, as the author's have cited. - Evaluation scope narrow: experiments limited to a handful of VLMs; no systematic scaling or modality ablation.
1. The paper addresses a critical underexplored gap: evaluating VLMs on full, unmodified real-world tasks (1990s video games) that require integrated human-like abilities (perception, memory, real-time decision-making). Prior benchmarks rely on simplified grid worlds, text-only games, or game-specific tools (e.g., Gemini Plays Pokemon used pathfinding hints), making VideoGameBench a novel "no crutches" evaluation. 2. The benchmark construction is rigorous: it supports multiple emulators (PyBoy,
1. The benchmark focuses exclusively on 1990s 2D/3D games (Game Boy, MS-DOS), excluding modern game mechanics (e.g., open-world exploration, multiplayer, touch controls) or other classic platforms (e.g., NES, Sega Genesis). This narrows the generalizability of results—VLMs may fail differently on games with distinct interaction paradigms (e.g., point-and-click adventures vs. real-time strategy). 2. There is no human baseline for the full benchmark—only confirmation that humans can complete pract
The benchmark is very cool, and the engineering and effort that must have gone behind making this are surely impressive. The paper is well written and easy to follow.
One of my main criticisms for this paper is that a benchmark were all the current models score 0% on basically every single task is not an interesting benchmark. This surely makes the benchmark future-proof, but the amount of insights it can provide now compared to existing benchmarks with more fine-grained progression systems is lacking. Without a more fine-grained progression system, or easier games where current VLMs can get some amount of performance, the benchmark can offer limited insights
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
