What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data
Oier Mees, Lukas Hermann, Wolfram Burgard

TL;DR
This paper investigates critical challenges in language-conditioned robotic imitation learning from offline data, proposing architectural improvements and a novel model that significantly outperforms existing methods on complex manipulation benchmarks.
Contribution
It introduces a comprehensive analysis of design choices and presents a new model with hierarchical control, multimodal transformers, and contrastive learning, advancing the state of the art.
Findings
Significant performance improvements on CALVIN benchmark
Effective use of hierarchical decomposition and multimodal transformers
Open-sourced code and models for future research
Abstract
A long-standing goal in robotics is to build robots that can perform a wide range of daily tasks from perceptions obtained with their onboard sensors and specified only via natural language. While recently substantial advances have been achieved in language-driven robotics by leveraging end-to-end learning from pixels, there is no clear and well-understood process for making various design choices due to the underlying variation in setups. In this paper, we conduct an extensive study of the most critical challenges in learning language conditioned policies from offline free-form imitation datasets. We further identify architectural and algorithmic techniques that improve performance, such as a hierarchical decomposition of the robot control learning, a multimodal transformer encoder, discrete latent plans and a self-supervised contrastive loss that aligns video and language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
