GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language   Models Through Traversing 2D Game Maps

Muhammad Umair Nasir; Steven James; Julian Togelius

arXiv:2410.07765·cs.CL·October 11, 2024

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

Muhammad Umair Nasir, Steven James, Julian Togelius

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces GameTraversalBenchmark (GTB), a new benchmark to evaluate the planning abilities of large language models using 2D game maps, revealing current models' limited performance and potential for improvement.

Contribution

The paper presents GTB, a novel benchmark for assessing LLMs' planning skills in 2D grid-based games, and evaluates several models, highlighting their strengths and limitations.

Findings

01

GPT-4-Turbo scored 44.97% on GTB extsubscript{Score}

02

Large reasoning models like o1 scored 67.84% on GTBS

03

Current models find the benchmark challenging

Abstract

Large language models (LLMs) have recently demonstrated great success in generating and understanding natural language. While they have also shown potential beyond the domain of natural language, it remains an open question as to what extent and in which way these LLMs can plan. We investigate their planning capabilities by proposing GameTraversalBenchmark (GTB), a benchmark consisting of diverse 2D grid-based game maps. An LLM succeeds if it can traverse through given objectives, with a minimum number of steps and a minimum number of generation errors. We evaluate a number of LLMs on GTB and found that GPT-4-Turbo achieved the highest score of 44.97% on GTB\_Score (GTBS), a composite score that combines the three above criteria. Furthermore, we preliminarily test large reasoning models, namely o1, which scores $67.84%$ on GTBS, indicating that the benchmark remains challenging for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

umair-nasir14/game-traversal-benchmark
noneOfficial

Videos

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps· slideslive

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Natural Language Processing Techniques · Topic Modeling