VIMA: General Robot Manipulation with Multimodal Prompts
Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou,, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, Linxi Fan

TL;DR
VIMA introduces a transformer-based model that uses multimodal prompts to unify various robot manipulation tasks, achieving strong zero-shot generalization and data efficiency in a new simulation benchmark.
Contribution
The paper presents a novel multimodal prompt framework and a scalable transformer-based robot agent, VIMA, for unified manipulation learning with systematic evaluation.
Findings
VIMA outperforms alternatives in zero-shot generalization by up to 2.9x success rate.
VIMA maintains high performance with 10x less training data, outperforming competitors.
A new benchmark with thousands of tasks and expert trajectories supports systematic evaluation.
Abstract
Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We design a transformer-based robot agent, VIMA, that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
