MUTEX: Learning Unified Policies from Multimodal Task Specifications

Rutav Shah; Roberto Mart\'in-Mart\'in; Yuke Zhu

arXiv:2309.14320·cs.RO·September 26, 2023·5 cites

MUTEX: Learning Unified Policies from Multimodal Task Specifications

Rutav Shah, Roberto Mart\'in-Mart\'in, Yuke Zhu

PDF

Open Access 1 Datasets

TL;DR

MUTEX is a transformer-based framework enabling robots to understand and execute tasks from multiple modalities of instructions, improving cross-modal task comprehension and execution in simulation and real-world settings.

Contribution

It introduces a unified multimodal policy learning approach using a two-stage training process with cross-modal reasoning capabilities.

Findings

01

Outperforms single-modality trained methods in task execution accuracy.

02

Effective in both simulation and real-world environments.

03

Supports six different instruction modalities for flexible task following.

Abstract

Humans use different modalities, such as speech, text, images, videos, etc., to communicate their intent and goals with teammates. For robots to become better assistants, we aim to endow them with the ability to follow instructions and understand tasks specified by their human partners. Most robotic policy learning methods have focused on one single modality of task specification while ignoring the rich cross-modal information. We present MUTEX, a unified approach to policy learning from multimodal task specifications. It trains a transformer-based architecture to facilitate cross-modal reasoning, combining masked modeling and cross-modal matching objectives in a two-stage training procedure. After training, MUTEX can follow a task specification in any of the six learned modalities (video demonstrations, goal images, text goal descriptions, text instructions, speech goal descriptions,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

lerobot/utaustin_mutex
dataset· 1.2k dl
1.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques