Plancraft: an evaluation dataset for planning with LLM agents
Gautier Dagan, Frank Keller, Alex Lascarides

TL;DR
Plancraft is a comprehensive multi-modal dataset designed to evaluate LLM agents' planning, tool use, and decision-making in Minecraft, highlighting current limitations and guiding future improvements.
Contribution
This paper introduces Plancraft, a novel dataset with multi-modal interfaces and challenging tasks for benchmarking LLM agent planning and reasoning capabilities.
Findings
LLMs and VLMs struggle with complex planning tasks in Plancraft.
Benchmarking reveals performance gaps between open-source and closed-source models.
The dataset enables analysis of tool use and solvability assessment in LLM agents.
Abstract
We present Plancraft, a multi-modal evaluation dataset for LLM agents. Plancraft has both a text-only and multi-modal interface, based on the Minecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and Retrieval Augmented Generation (RAG), as well as a handcrafted planner and Oracle Retriever, to ablate the different components of a modern agent architecture. To evaluate decision-making, Plancraft also includes a subset of examples that are intentionally unsolvable, providing a realistic challenge that requires the agent not only to complete tasks but also to decide whether they are solvable at all. We benchmark both open-source and closed-source LLMs and compare their performance and efficiency to a handcrafted planner. Overall, we find that LLMs and VLMs struggle with the planning problems that Plancraft introduces, and offer suggestions on how to improve their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Simulation Techniques and Applications · Semantic Web and Ontologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Attention Dropout · Linear Layer · Softmax · Dense Connections · Linear Warmup With Linear Decay · Dropout · WordPiece
