A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

Chris Madge; Massimo Poesio

arXiv:2407.12734·cs.CL·July 18, 2024

A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

Chris Madge, Massimo Poesio

PDF

Open Access

TL;DR

This paper introduces a new benchmark based on the Minecraft builder task to evaluate large language models' spatial reasoning and building capabilities, using synthetic tasks to identify strengths and weaknesses.

Contribution

It adapts the Minecraft builder task into a comprehensive synthetic benchmark for assessing LLMs in spatial reasoning and builder agent design.

Findings

01

Benchmark effectively tests spatial reasoning in LLMs.

02

Synthetic tasks reveal specific strengths and weaknesses.

03

Supports development of better builder agents.

Abstract

In this work we proposing adapting the Minecraft builder task into an LLM benchmark suitable for evaluating LLM ability in spatially orientated tasks, and informing builder agent design. Previous works have proposed corpora with varying complex structures, and human written instructions. We instead attempt to provide a comprehensive synthetic benchmark for testing builder agents over a series of distinct tasks that comprise of common building operations. We believe this approach allows us to probe specific strengths and weaknesses of different agents, and test the ability of LLMs in the challenging area of spatial reasoning and vector based math.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsService-Oriented Architecture and Web Services · Multi-Agent Systems and Negotiation · Semantic Web and Ontologies