lmgame-Bench: How Good are LLMs at Playing Games?
Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, Hao Zhang

TL;DR
This paper introduces lmgame-Bench, a comprehensive platform for evaluating large language models' abilities in playing various video games, addressing previous evaluation challenges and demonstrating the models' capabilities and transfer learning potential.
Contribution
The paper presents lmgame-Bench, a standardized, contamination-free evaluation suite for LLMs in gaming, with a unified API and scaffolds, enabling reliable assessment and transfer learning analysis.
Findings
lmgame-Bench effectively differentiates model capabilities.
Models show transferability from single-game reinforcement learning.
Games probe distinct, isolated capabilities.
Abstract
Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of…
Peer Reviews
Decision·ICLR 2026 Poster
See Summary
See Summary
1. Modular Harness Design: Addresses a key limitation of prior game benchmarks (entangled skills) by enabling selective activation of perception, memory, and reasoning modules. This allows fine-grained diagnosis of model strengths/weaknesses (e.g., separating perception failures from planning gaps) that was previously unachievable. 2. Rigorous Experimental Design: Evaluates 13 models across 6 diverse games (platformer, puzzle, narrative) with standardized metrics (progression/long-horizon reward
1. Limited Game Diversity: While the 6 games cover 3 genres, they lack representation of real-time strategy (RTS), open-world, or multiplayer games—domains that test collaboration, dynamic resource management, or complex opponent adaptation. This limits the benchmark’s generalizability to broader game-based agentic tasks. 2. Computational Cost Opacity: While the paper mentions high computational costs (Appendix B.4), it does not provide concrete guidance for scaling evaluations (e.g., cost-savin
1. A new benchmark consisting of complex goal-driven games is introduced in this study. 2. An extensive suite of models is evaluated, covering 13 state-of-the-art architectures. 3. The problem statement and the experimental framework are well designed and presented. 4. The authors perform detailed and consistent evaluations across difficulty levels, revealing how and where models fail.
While the work is interesting and systematically executed, many of its findings align with prior studies that have already established similar limitations of LLMs and explored methods to overcome them (e.g., Chain-of-Thought reasoning, embedding API calls, or memory modules/database access). The novelty and contribution of this work, therefore, feel limited unless the authors can better justify what new insights their benchmark offers. Additionally, while the authors show that adding different
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Multi-Agent Systems and Negotiation · Digital Rights Management and Security
