VIMA: General Robot Manipulation with Multimodal Prompts

Yunfan Jiang; Agrim Gupta; Zichen Zhang; Guanzhi Wang; Yongqiang Dou,; Yanjun Chen; Li Fei-Fei; Anima Anandkumar; Yuke Zhu; Linxi Fan

arXiv:2210.03094·cs.RO·May 30, 2023·65 cites

VIMA: General Robot Manipulation with Multimodal Prompts

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou,, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, Linxi Fan

PDF

Open Access 2 Repos 1 Models 1 Datasets

TL;DR

VIMA introduces a transformer-based model that uses multimodal prompts to unify various robot manipulation tasks, achieving strong zero-shot generalization and data efficiency in a new simulation benchmark.

Contribution

The paper presents a novel multimodal prompt framework and a scalable transformer-based robot agent, VIMA, for unified manipulation learning with systematic evaluation.

Findings

01

VIMA outperforms alternatives in zero-shot generalization by up to 2.9x success rate.

02

VIMA maintains high performance with 10x less training data, outperforming competitors.

03

A new benchmark with thousands of tasks and expert trajectories supports systematic evaluation.

Abstract

Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We design a transformer-based robot agent, VIMA, that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
VIMA/VIMA
model· ♡ 17
♡ 17

Datasets

VIMA/VIMA-Data
dataset· 62 dl
62 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques