GAMEBoT: Transparent Assessment of LLM Reasoning in Games

Wenye Lin; Jonathan Roberts; Yunhan Yang; Samuel Albanie; Zongqing Lu; Kai Han

arXiv:2412.13602·cs.CL·June 3, 2025

GAMEBoT: Transparent Assessment of LLM Reasoning in Games

Wenye Lin, Jonathan Roberts, Yunhan Yang, Samuel Albanie, Zongqing Lu, Kai Han

PDF

Open Access 1 Repo

TL;DR

GAMEBoT introduces a transparent, modular gaming benchmark for evaluating LLM reasoning, enabling detailed assessment of intermediate steps and reducing data contamination, thus providing a more rigorous measure of LLM capabilities.

Contribution

This paper presents GAMEBoT, a novel gaming benchmark that decomposes complex reasoning into subproblems, uses rule-based ground truth, and facilitates transparent, rigorous LLM evaluation.

Findings

01

Benchmark challenges LLM reasoning abilities.

02

LLMs struggle even with detailed CoT prompts.

03

GAMEBoT reduces data contamination risks.

Abstract

Large Language Models (LLMs) are increasingly deployed in real-world applications that demand complex reasoning. To track progress, robust benchmarks are required to evaluate their capabilities beyond superficial pattern recognition. However, current LLM reasoning benchmarks often face challenges such as insufficient interpretability, performance saturation or data contamination. To address these challenges, we introduce GAMEBoT, a gaming arena designed for rigorous and transparent assessment of LLM reasoning capabilities. GAMEBoT decomposes complex reasoning in games into predefined modular subproblems. This decomposition allows us to design a suite of Chain-of-Thought (CoT) prompts that leverage domain knowledge to guide LLMs in addressing these subproblems before action selection. Furthermore, we develop a suite of rule-based algorithms to generate ground truth for these subproblems,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Visual-AI/GAMEBoT
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Software Engineering Research · Open Source Software Innovations