MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control
Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu, Sheng, Yu Qiao, Jing Shao

TL;DR
MineDreamer is an innovative embodied agent in Minecraft that uses a Chain-of-Imagination mechanism with multimodal models to follow complex instructions more accurately and reliably than previous methods.
Contribution
It introduces a novel Chain-of-Imagination approach combined with multimodal models to enhance instruction-following in simulated-world control tasks.
Findings
Significantly outperforms baseline agents in instruction-following accuracy.
Nearly doubles the performance of existing generalist agents.
Demonstrates strong generalization and understanding of the open world.
Abstract
It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways. However, existing approaches often fail to steadily follow instructions due to difficulties in understanding abstract and sequential natural language instructions. To this end, we introduce MineDreamer, an open-ended embodied agent built upon the challenging Minecraft simulator with an innovative paradigm that enhances instruction-following ability in low-level control signal generation. Specifically, MineDreamer is developed on top of recent advances in Multimodal Large Language Models (MLLMs) and diffusion models, and we employ a Chain-of-Imagination (CoI) mechanism to envision the step-by-step process of executing instructions and translating imaginations into more precise visual prompts tailored to the current state; subsequently, the agent generates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Human Motion and Animation · Model Reduction and Neural Networks
MethodsDiffusion
