MCU: An Evaluation Framework for Open-Ended Game Agents
Xinyue Zheng, Haowei Lin, Kaichen He, Zihao Wang, Zilong Zheng, Yitao Liang

TL;DR
This paper introduces MCU, a comprehensive evaluation framework within Minecraft that assesses open-ended AI agents across thousands of diverse, composable tasks, revealing current agents' struggles with complexity and diversity.
Contribution
The paper presents MCU, a scalable, diverse, and human-aligned evaluation framework for open-ended game agents in Minecraft, addressing limitations of existing benchmarks.
Findings
State-of-the-art agents struggle with complex, diverse tasks.
MCU's evaluation aligns 91.5% with human ratings.
The framework enables infinite task generation with varying difficulty.
Abstract
Developing AI agents capable of interacting with open-world environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce Minecraft Universe (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3,452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91.5\% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsArtificial Intelligence in Games · Multi-Agent Systems and Negotiation · AI-based Problem Solving and Planning
