Multimodal Diffusion Transformer: Learning Versatile Behavior from   Multimodal Goals

Moritz Reuss; \"Omer Erdin\c{c} Ya\u{g}murlu; Fabian Wenzel; Rudolf; Lioutikov

arXiv:2407.05996·cs.RO·July 9, 2024·1 cites

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Moritz Reuss, \"Omer Erdin\c{c} Ya\u{g}murlu, Fabian Wenzel, Rudolf, Lioutikov

PDF

Open Access 1 Repo

TL;DR

The paper presents MDT, a diffusion transformer framework that learns versatile, long-horizon manipulation behaviors from multimodal goals with minimal language annotations, outperforming existing methods on challenging benchmarks.

Contribution

Introduces a novel diffusion-based transformer with self-supervised objectives for multimodal goal-conditioned manipulation, enabling learning from sparsely annotated datasets.

Findings

01

Achieves state-of-the-art performance on CALVIN and LIBERO benchmarks.

02

Handles less than 2% language annotations in LIBERO.

03

Improves manipulation success by 15% over prior methods.

Abstract

This work introduces the Multimodal Diffusion Transformer (MDT), a novel diffusion policy framework, that excels at learning versatile behavior from multimodal goal specifications with few language annotations. MDT leverages a diffusion-based multimodal transformer backbone and two self-supervised auxiliary objectives to master long-horizon manipulation tasks based on multimodal goals. The vast majority of imitation learning methods only learn from individual goal modalities, e.g. either language or goal images. However, existing large-scale imitation learning datasets are only partially labeled with language annotations, which prohibits current methods from learning language conditioned behavior from these datasets. MDT addresses this challenge by introducing a latent goal-conditioned state representation that is simultaneously trained on multimodal goal instructions. This state…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

intuitive-robots/mdt_policy
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems