A LLM Benchmark based on the Minecraft Builder Dialog Agent Task
Chris Madge, Massimo Poesio

TL;DR
This paper introduces a new benchmark based on the Minecraft builder task to evaluate large language models' spatial reasoning and building capabilities, using synthetic tasks to identify strengths and weaknesses.
Contribution
It adapts the Minecraft builder task into a comprehensive synthetic benchmark for assessing LLMs in spatial reasoning and builder agent design.
Findings
Benchmark effectively tests spatial reasoning in LLMs.
Synthetic tasks reveal specific strengths and weaknesses.
Supports development of better builder agents.
Abstract
In this work we proposing adapting the Minecraft builder task into an LLM benchmark suitable for evaluating LLM ability in spatially orientated tasks, and informing builder agent design. Previous works have proposed corpora with varying complex structures, and human written instructions. We instead attempt to provide a comprehensive synthetic benchmark for testing builder agents over a series of distinct tasks that comprise of common building operations. We believe this approach allows us to probe specific strengths and weaknesses of different agents, and test the ability of LLMs in the challenging area of spatial reasoning and vector based math.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsService-Oriented Architecture and Web Services · Multi-Agent Systems and Negotiation · Semantic Web and Ontologies
