GAMEBoT: Transparent Assessment of LLM Reasoning in Games
Wenye Lin, Jonathan Roberts, Yunhan Yang, Samuel Albanie, Zongqing Lu, Kai Han

TL;DR
GAMEBoT introduces a transparent, modular gaming benchmark for evaluating LLM reasoning, enabling detailed assessment of intermediate steps and reducing data contamination, thus providing a more rigorous measure of LLM capabilities.
Contribution
This paper presents GAMEBoT, a novel gaming benchmark that decomposes complex reasoning into subproblems, uses rule-based ground truth, and facilitates transparent, rigorous LLM evaluation.
Findings
Benchmark challenges LLM reasoning abilities.
LLMs struggle even with detailed CoT prompts.
GAMEBoT reduces data contamination risks.
Abstract
Large Language Models (LLMs) are increasingly deployed in real-world applications that demand complex reasoning. To track progress, robust benchmarks are required to evaluate their capabilities beyond superficial pattern recognition. However, current LLM reasoning benchmarks often face challenges such as insufficient interpretability, performance saturation or data contamination. To address these challenges, we introduce GAMEBoT, a gaming arena designed for rigorous and transparent assessment of LLM reasoning capabilities. GAMEBoT decomposes complex reasoning in games into predefined modular subproblems. This decomposition allows us to design a suite of Chain-of-Thought (CoT) prompts that leverage domain knowledge to guide LLMs in addressing these subproblems before action selection. Furthermore, we develop a suite of rule-based algorithms to generate ground truth for these subproblems,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Software Engineering Research · Open Source Software Innovations
