GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas, Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, Arjun Yadav

TL;DR
GameBench is a new benchmark for assessing the strategic reasoning abilities of large language model agents across diverse game environments, highlighting current limitations and potential improvements.
Contribution
Introduces GameBench, a comprehensive cross-domain benchmark for evaluating strategic reasoning in LLM agents, with analysis of GPT-3 and GPT-4 performance using different prompting frameworks.
Findings
Models do not match human strategic reasoning performance.
GPT-4 performs worse than random in some scenarios.
CoT and RAP improve reasoning scores but remain below human levels.
Abstract
Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Auction Theory and Applications · Semantic Web and Ontologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Softmax · Balanced Selection · Focus · Layer Normalization · Weight Decay · Linear Warmup With Cosine Annealing
