Instruction-Following Agents with Multimodal Transformer

Hao Liu; Lisa Lee; Kimin Lee; Pieter Abbeel

arXiv:2210.13431·cs.CV·March 28, 2023·6 cites

Instruction-Following Agents with Multimodal Transformer

Hao Liu, Lisa Lee, Kimin Lee, Pieter Abbeel

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multimodal transformer model that encodes visual observations and language instructions for robots, enabling improved instruction-following in vision-based environments with better scalability and generalization.

Contribution

The paper presents a unified transformer-based model pre-trained on large image-text datasets, effectively integrating vision and language for instruction-following tasks in robotics.

Findings

01

Outperforms state-of-the-art methods in single-task settings

02

Demonstrates superior multi-task generalization

03

Shows improved scalability over prior models

Abstract

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models lack visual grounding, making it difficult to connect language instructions with visual observations. On the other hand, methods that use pre-trained multimodal models typically come with divided language and visual representations, requiring designing specialized network architecture to fuse them together. We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments. Our \ours method consists of a multimodal transformer that encodes visual observations and language instructions, and a transformer-based policy that predicts actions based on encoded representations. The multimodal transformer is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lhao499/instructrl
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling