Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning
Jiachen Li, Qiaozi Gao, Michael Johnston, Xiaofeng Gao, Xuehai He,, Suhaila Shakiah, Hangjie Shi, Reza Ghanadan, William Yang Wang

TL;DR
This paper introduces a multimodal prompt-based framework for robot manipulation, combining vision and language signals through pretraining and multi-task fine-tuning, achieving state-of-the-art results on VIMA-BENCH.
Contribution
It presents a novel two-stage training pipeline and a multimodal prompt encoder that enhances robot understanding of combined vision and language cues.
Findings
10% success rate improvement on VIMA-BENCH
Effective multimodal understanding and in-context learning demonstrated
State-of-the-art performance achieved
Abstract
Prompt-based learning has been demonstrated as a compelling paradigm contributing to large language models' tremendous success (LLMs). Inspired by their success in language tasks, existing research has leveraged LLMs in embodied instruction following and task planning. In this work, we tackle the problem of training a robot to understand multimodal prompts, interleaving vision signals with text descriptions. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. In this work, we introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts from multi-task expert trajectories. Our methods consist of a two-stage training pipeline that performs inverse dynamics pretraining and multi-task finetuning. To facilitate multimodal understanding, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsResidual Connection
