TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Haochuan Wang; Xiachong Feng; Lei Li; Yu Guo; Zhanyue Qin; Dianbo Sui; Lingpeng Kong

arXiv:2410.10479·cs.AI·May 28, 2025

TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Haochuan Wang, Xiachong Feng, Lei Li, Yu Guo, Zhanyue Qin, Dianbo Sui, Lingpeng Kong

PDF

Open Access 1 Repo 3 Reviews

TL;DR

TMGBench is a comprehensive, extensible game benchmark designed to evaluate the strategic reasoning abilities of large language models across diverse game types and complex structures.

Contribution

The paper introduces TMGBench, a new benchmark covering all 144 2x2 game types with diverse scenarios and complex organization, addressing limitations of previous benchmarks.

Findings

01

LLMs show flaws in strategic reasoning accuracy and consistency

02

State-of-the-art models vary in Theory-of-Mind capabilities

03

Complex game structures pose significant challenges for LLMs

Abstract

The rapid advancement of large language models has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate the strategic reasoning capabilities of LLMs, game theory, with its concise structure, has become the preferred approach for many researchers. However, current research typically focuses on a limited selection of games, resulting in low coverage of game types. Additionally, classic game scenarios carry risks of data leakage, and the benchmarks used often lack extensibility, rendering them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, characterized by comprehensive game type coverage, diverse scenarios and flexible game organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games in our…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

- The paper is well written and well organized. - The games included in TMGBENCH are comphrehensive.

Weaknesses

- I am not fully convinced there exists the need for a benchmark fo evaluating strategic reasoning abilities of LLMs. In fact, there lacks an universal definition of the ability of strategic reasoning. In other words, what are the fundemental differences between tasks that require strategic reasoning and tasks that do not? - If there is a clear definition of strategic reasoning, I would expect a more systematic study of existing LLMs on strategic reasoning. Why some LLMs perform better than ot

Reviewer 02Rating 5Confidence 4

Strengths

- Models are tested rigorously; 2,880 times for a single model in the single game tests, the complex games have a baseline of being tested 20 times, and there's testing for positional bias with the reFoToM / reSoToM prompts. - Extensibility: this is a great way of creating a difficult-to-overfit-to benchmark, using the synthetic data generated stories as additional "games" to play. - The metrics used (ID, BD, PAR) are comprehensive for evaluating a model's performance and good insight to how the

Weaknesses

- The paper can be hard to follow at times. It would be nice to have examples of the complex games to solidify the reader's understanding. The description given for sequential games doesn't quite make sense to me, even with two introductions. And because of that, I'm not sure how well it upholds the task of "testing for strategic reasoning". - I'm not convinced that parallel forms are actually a test of strategic reasoning either, this seems closer to measuring the model's "working memory" and b

Reviewer 03Rating 8Confidence 2

Strengths

- The paper is very well-written. - Objectives are clear, and how those objectives are achieved by this work is well demonstrated. - Quantified metrics and visualisations have been used to compare LLMs on different tasks to assess their capabilities. - Extensive experiments were conducted to exam the failure cases and the effect of ToM. - Limitations were also discussed. - Generation pipeline was demonstrated in Appendix. Overall, the reviewer quite enjoyed reading this paper.

Weaknesses

No particular weakness was identified by the reviewer. The reviewer is not an expert in game theory or reasoning. It is quite likely that the reviewer is unfamiliar with some pieces of related work or crucial part of this work.

Code & Models

Repositories

pinkex/tmgbench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpen Education and E-Learning