JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models
Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei, Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma,, Yitao Liang

TL;DR
JARVIS-1 is a multimodal, memory-augmented agent capable of understanding and executing over 200 diverse tasks in Minecraft, demonstrating significant improvements in long-horizon task completion over existing agents.
Contribution
The paper introduces JARVIS-1, a novel open-world agent that integrates multimodal perception, planning, embodied control, and memory to handle an extensive range of tasks in Minecraft.
Findings
Achieves nearly perfect performance on short-horizon tasks.
Surpasses state-of-the-art in long-horizon tasks by 5 times.
Successfully completes over 200 diverse tasks in Minecraft.
Abstract
Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
