GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps
Muhammad Umair Nasir, Steven James, Julian Togelius

TL;DR
This paper introduces GameTraversalBenchmark (GTB), a new benchmark to evaluate the planning abilities of large language models using 2D game maps, revealing current models' limited performance and potential for improvement.
Contribution
The paper presents GTB, a novel benchmark for assessing LLMs' planning skills in 2D grid-based games, and evaluates several models, highlighting their strengths and limitations.
Findings
GPT-4-Turbo scored 44.97% on GTB extsubscript{Score}
Large reasoning models like o1 scored 67.84% on GTBS
Current models find the benchmark challenging
Abstract
Large language models (LLMs) have recently demonstrated great success in generating and understanding natural language. While they have also shown potential beyond the domain of natural language, it remains an open question as to what extent and in which way these LLMs can plan. We investigate their planning capabilities by proposing GameTraversalBenchmark (GTB), a benchmark consisting of diverse 2D grid-based game maps. An LLM succeeds if it can traverse through given objectives, with a minimum number of steps and a minimum number of generation errors. We evaluate a number of LLMs on GTB and found that GPT-4-Turbo achieved the highest score of 44.97% on GTB\_Score (GTBS), a composite score that combines the three above criteria. Furthermore, we preliminarily test large reasoning models, namely o1, which scores on GTBS, indicating that the benchmark remains challenging for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Natural Language Processing Techniques · Topic Modeling
