GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via   Game-Theoretic Evaluations

Jinhao Duan; Renming Zhang; James Diffenderfer; Bhavya Kailkhura,; Lichao Sun; Elias Stengel-Eskin; Mohit Bansal; Tianlong Chen; Kaidi Xu

arXiv:2402.12348·cs.CL·June 11, 2024·3 cites

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura,, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, Kaidi Xu

PDF

Open Access 2 Repos

TL;DR

This paper introduces GTBench, a game-theoretic evaluation environment for LLMs, revealing their strategic reasoning strengths and weaknesses across various game scenarios and comparing open-source and commercial models.

Contribution

It proposes a comprehensive game-theoretic benchmarking framework for LLMs, analyzing their reasoning abilities and behaviors in diverse strategic scenarios.

Findings

01

LLMs struggle in complete and deterministic games

02

Open-source LLMs are less competitive than commercial ones in complex games

03

Code pretraining enhances strategic reasoning, but advanced reasoning methods have mixed effects

Abstract

As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBench, a language-driven environment composing 10 widely recognized tasks, across a comprehensive game taxonomy: complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. Then, we (1) Characterize the game-theoretic reasoning of LLMs; and (2) Perform LLM-vs.-LLM competitions as reasoning evaluation. We observe that (1) LLMs have distinct behaviors regarding various gaming scenarios; for example, LLMs fail in complete and deterministic games yet they are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuction Theory and Applications

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Adam · Softmax · Multi-Head Attention · Layer Normalization · Dropout · Residual Connection