Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals
Moritz Reuss, \"Omer Erdin\c{c} Ya\u{g}murlu, Fabian Wenzel, Rudolf, Lioutikov

TL;DR
The paper presents MDT, a diffusion transformer framework that learns versatile, long-horizon manipulation behaviors from multimodal goals with minimal language annotations, outperforming existing methods on challenging benchmarks.
Contribution
Introduces a novel diffusion-based transformer with self-supervised objectives for multimodal goal-conditioned manipulation, enabling learning from sparsely annotated datasets.
Findings
Achieves state-of-the-art performance on CALVIN and LIBERO benchmarks.
Handles less than 2% language annotations in LIBERO.
Improves manipulation success by 15% over prior methods.
Abstract
This work introduces the Multimodal Diffusion Transformer (MDT), a novel diffusion policy framework, that excels at learning versatile behavior from multimodal goal specifications with few language annotations. MDT leverages a diffusion-based multimodal transformer backbone and two self-supervised auxiliary objectives to master long-horizon manipulation tasks based on multimodal goals. The vast majority of imitation learning methods only learn from individual goal modalities, e.g. either language or goal images. However, existing large-scale imitation learning datasets are only partially labeled with language annotations, which prohibits current methods from learning language conditioned behavior from these datasets. MDT addresses this challenge by introducing a latent goal-conditioned state representation that is simultaneously trained on multimodal goal instructions. This state…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
