VideoGameBench: Can Vision-Language Models complete popular video games?

Alex L. Zhang; Thomas L. Griffiths; Karthik R. Narasimhan; Ofir Press

arXiv:2505.18134·cs.AI·May 18, 2026

VideoGameBench: Can Vision-Language Models complete popular video games?

Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, Ofir Press

PDF

1 Repo 4 Reviews

TL;DR

VideoGameBench evaluates vision-language models' ability to play classic video games using only visual inputs and high-level objectives, revealing current models' limited capabilities in real-time gameplay.

Contribution

Introduces VideoGameBench, a novel benchmark for assessing vision-language models on real-time video game tasks with minimal auxiliary information.

Findings

01

Models complete less than 1% of the games in the benchmark.

02

Inference latency significantly hampers model performance in real-time settings.

03

VideoGameBench Lite reduces latency by pausing the game during model inference.

Abstract

Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 4Confidence 3

Strengths

See Summary

Weaknesses

See Summary

Reviewer 02Rating 2Confidence 5

Strengths

- Challenging environments and strict train/test split are well-needed. - Use of perceptual hash to estimate game completion is interesting.

Weaknesses

- Many similar benchmarks already exist, as the author's have cited. - Evaluation scope narrow: experiments limited to a handful of VLMs; no systematic scaling or modality ablation.

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper addresses a critical underexplored gap: evaluating VLMs on full, unmodified real-world tasks (1990s video games) that require integrated human-like abilities (perception, memory, real-time decision-making). Prior benchmarks rely on simplified grid worlds, text-only games, or game-specific tools (e.g., Gemini Plays Pokemon used pathfinding hints), making VideoGameBench a novel "no crutches" evaluation. 2. The benchmark construction is rigorous: it supports multiple emulators (PyBoy,

Weaknesses

1. The benchmark focuses exclusively on 1990s 2D/3D games (Game Boy, MS-DOS), excluding modern game mechanics (e.g., open-world exploration, multiplayer, touch controls) or other classic platforms (e.g., NES, Sega Genesis). This narrows the generalizability of results—VLMs may fail differently on games with distinct interaction paradigms (e.g., point-and-click adventures vs. real-time strategy). 2. There is no human baseline for the full benchmark—only confirmation that humans can complete pract

Reviewer 04Rating 2Confidence 5

Strengths

The benchmark is very cool, and the engineering and effort that must have gone behind making this are surely impressive. The paper is well written and easy to follow.

Weaknesses

One of my main criticisms for this paper is that a benchmark were all the current models score 0% on basically every single task is not an interesting benchmark. This surely makes the benchmark future-proof, but the amount of insights it can provide now compared to existing benchmarks with more fine-grained progression systems is lacking. Without a more fine-grained progression system, or easier games where current VLMs can get some amount of performance, the benchmark can offer limited insights

Code & Models

Repositories

alexzhang13/videogamebench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.