Instruction-Following Agents with Multimodal Transformer
Hao Liu, Lisa Lee, Kimin Lee, Pieter Abbeel

TL;DR
This paper introduces a multimodal transformer model that encodes visual observations and language instructions for robots, enabling improved instruction-following in vision-based environments with better scalability and generalization.
Contribution
The paper presents a unified transformer-based model pre-trained on large image-text datasets, effectively integrating vision and language for instruction-following tasks in robotics.
Findings
Outperforms state-of-the-art methods in single-task settings
Demonstrates superior multi-task generalization
Shows improved scalability over prior models
Abstract
Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models lack visual grounding, making it difficult to connect language instructions with visual observations. On the other hand, methods that use pre-trained multimodal models typically come with divided language and visual representations, requiring designing specialized network architecture to fuse them together. We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments. Our \ours method consists of a multimodal transformer that encodes visual observations and language instructions, and a transformer-based policy that predicts actions based on encoded representations. The multimodal transformer is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
