JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal   Language Models

Zihao Wang; Shaofei Cai; Anji Liu; Yonggang Jin; Jinbing Hou; Bowei; Zhang; Haowei Lin; Zhaofeng He; Zilong Zheng; Yaodong Yang; Xiaojian Ma,; Yitao Liang

arXiv:2311.05997·cs.AI·December 1, 2023·6 cites

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei, Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma,, Yitao Liang

PDF

Open Access 1 Datasets

TL;DR

JARVIS-1 is a multimodal, memory-augmented agent capable of understanding and executing over 200 diverse tasks in Minecraft, demonstrating significant improvements in long-horizon task completion over existing agents.

Contribution

The paper introduces JARVIS-1, a novel open-world agent that integrates multimodal perception, planning, embodied control, and memory to handle an extensive range of tasks in Minecraft.

Findings

01

Achieves nearly perfect performance on short-horizon tasks.

02

Surpasses state-of-the-art in long-horizon tasks by 5 times.

03

Successfully completes over 200 diverse tasks in Minecraft.

Abstract

Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

TESS-Computer/minecraft-vla-stage3
dataset· 26 dl
26 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques