Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning
Jianxiong Li, Zhihao Wang, Jinliang Zheng, Xiaoai Zhou, Guanming Wang,, Guanglu Song, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Junzhi Yu, Xianyuan Zhan

TL;DR
Robo-MUTUAL introduces a novel approach for robotic multimodal task specification that leverages unimodal data and out-of-domain pretraining to achieve effective cross-modality alignment, enabling robots to understand complex instructions with limited paired data.
Contribution
The paper presents a method to teach robots multimodal task understanding using unimodal data and novel alignment techniques, reducing the need for extensive paired multimodal datasets.
Findings
Outperforms existing methods on the LIBERO benchmark.
Successfully generalizes to over 130 tasks and real robot platforms.
Achieves significant improvements in data efficiency for robotic learning.
Abstract
Multimodal task specification is essential for enhanced robotic performance, where \textit{Cross-modality Alignment} enables the robot to holistically understand complex task instructions. Directly annotating multimodal instructions for model training proves impractical, due to the sparsity of paired multimodal data. In this study, we demonstrate that by leveraging unimodal instructions abundant in real data, we can effectively teach robots to learn multimodal task specifications. First, we endow the robot with strong \textit{Cross-modality Alignment} capabilities, by pretraining a robotic multimodal encoder using extensive out-of-domain data. Then, we employ two Collapse and Corrupt operations to further bridge the remaining modality gap in the learned multimodal representation. This approach projects different modalities of identical task goal as interchangeable representations, thus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Robot Manipulation and Learning · Robotics and Automated Systems
