GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

Anthony Costarelli; Mat Allen; Roman Hauksson; Grace Sodunke; Suhas; Hariharan; Carlson Cheng; Wenjie Li; Joshua Clymer; Arjun Yadav

arXiv:2406.06613·cs.CL·July 23, 2024·2 cites

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas, Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, Arjun Yadav

PDF

Open Access 2 Repos

TL;DR

GameBench is a new benchmark for assessing the strategic reasoning abilities of large language model agents across diverse game environments, highlighting current limitations and potential improvements.

Contribution

Introduces GameBench, a comprehensive cross-domain benchmark for evaluating strategic reasoning in LLM agents, with analysis of GPT-3 and GPT-4 performance using different prompting frameworks.

Findings

01

Models do not match human strategic reasoning performance.

02

GPT-4 performs worse than random in some scenarios.

03

CoT and RAP improve reasoning scores but remain below human levels.

Abstract

Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Auction Theory and Applications · Semantic Web and Ontologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Softmax · Balanced Selection · Focus · Layer Normalization · Weight Decay · Linear Warmup With Cosine Annealing