GameArena: Evaluating LLM Reasoning through Live Computer Games
Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin,, Hao Zhang

TL;DR
GameArena is a novel interactive benchmark that evaluates large language models' reasoning skills through engaging, real-time gameplay, providing detailed insights into their reasoning processes in dynamic, human-in-the-loop settings.
Contribution
It introduces a new dynamic, interactive benchmark with specific reasoning tasks, enabling detailed analysis of LLM reasoning in real-world gameplay scenarios.
Findings
Collected over 2000 game sessions for analysis.
Assessed reasoning capabilities of five state-of-the-art LLMs.
Improved user engagement compared to existing benchmarks.
Abstract
Evaluating the reasoning abilities of large language models (LLMs) is challenging. Existing benchmarks often depend on static datasets, which are vulnerable to data contamination and may get saturated over time, or on binary live human feedback that conflates reasoning with other abilities. As the most prominent dynamic benchmark, Chatbot Arena evaluates open-ended questions in real-world settings, but lacks the granularity in assessing specific reasoning capabilities. We introduce GameArena, a dynamic benchmark designed to evaluate LLM reasoning capabilities through interactive gameplay with humans. GameArena consists of three games designed to test specific reasoning capabilities (e.g., deductive and inductive reasoning), while keeping participants entertained and engaged. We analyze the gaming data retrospectively to uncover the underlying reasoning processes of LLMs and measure…
Peer Reviews
Decision·ICLR 2025 Poster
The paper is well-written and fairly clear. It presents and interesting idea to bypass data contamination issues when evaluating LLM capabilities by leveraging text-based games and the sequence of response rounds during their play, to try and evaluate the reasoning abilities of a given model in different dimensions. The proposed approach also emphasizes that models output their "rationale" for each game round response, and these are then used to calculate specific metrics that map to the reaso
I really like the intended approach and the motivation for the proposed analysis, however, I see a couple weaknesses. Firstly, the analysis of reasoning capabilities depends on a "replay" of a concluded game session round by round. At each round the replay prompt asks the models to output their "intermediary thought process". As LLMs are know for hallucination and fabricating rationalizations, this could heavily affect the proposed approach. There is no guarantee that whatever "evidence" being
Within 10 pages of content, the paper managed to include abundant information including a detailed description of proposed tasks, comprehensive experimental settings, and multi-aspects analysis across five SOTA models. The author includes three separate games each well evaluating the capability of one specific reasoning skill possessed by tested LLM. The results and conclusions are solid and well organized. I believe such research could highly benefit the academic society as a reasoning benchma
- In lines 82 to 84, the author states the effectiveness of data collection sorely based on reasoning assessment. The comparison may not be fair since Chatbot Arena aims to evaluate human preference across a gigantic number of tasks and each pair-wise comparison contributes to the ranking. - The description of the Taboo game in Figure 2 is confusing. The target of "utter the target word unconsciously" is ambiguous. An example of human win would be better to demonstrate the expected response fit
+ Carefully designed metrics. + Gamified framework boosts human interests and willingness in participating in the evaluations. + Extensive experiments including 5 models, including open-sourced ones and commercial ones.
- The motivation, especially how using this benchmark can have findings different from / aligning with other benchmarks, such as Chatbot Arena, GameBench, or GTBench, needs further explanation. How can the findings generalize to other games / downstream tasks? - Results need deeper analysis (question 3, 6). - Lack of detailed game statistics, for example, the distribution of target words, etc. (question 4). This could be some potential biases for models.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Artificial Intelligence in Law · Software Engineering Research
