MUTEX: Learning Unified Policies from Multimodal Task Specifications
Rutav Shah, Roberto Mart\'in-Mart\'in, Yuke Zhu

TL;DR
MUTEX is a transformer-based framework enabling robots to understand and execute tasks from multiple modalities of instructions, improving cross-modal task comprehension and execution in simulation and real-world settings.
Contribution
It introduces a unified multimodal policy learning approach using a two-stage training process with cross-modal reasoning capabilities.
Findings
Outperforms single-modality trained methods in task execution accuracy.
Effective in both simulation and real-world environments.
Supports six different instruction modalities for flexible task following.
Abstract
Humans use different modalities, such as speech, text, images, videos, etc., to communicate their intent and goals with teammates. For robots to become better assistants, we aim to endow them with the ability to follow instructions and understand tasks specified by their human partners. Most robotic policy learning methods have focused on one single modality of task specification while ignoring the rich cross-modal information. We present MUTEX, a unified approach to policy learning from multimodal task specifications. It trains a transformer-based architecture to facilitate cross-modal reasoning, combining masked modeling and cross-modal matching objectives in a two-stage training procedure. After training, MUTEX can follow a task specification in any of the six learned modalities (video demonstrations, goal images, text goal descriptions, text instructions, speech goal descriptions,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
