PaLM-E: An Embodied Multimodal Language Model

Danny Driess; Fei Xia; Mehdi S. M. Sajjadi; Corey Lynch; Aakanksha; Chowdhery; Brian Ichter; Ayzaan Wahid; Jonathan Tompson; Quan Vuong; Tianhe; Yu; Wenlong Huang; Yevgen Chebotar; Pierre Sermanet; Daniel Duckworth; Sergey; Levine; Vincent Vanhoucke; Karol Hausman; Marc Toussaint; Klaus Greff; Andy; Zeng; Igor Mordatch; Pete Florence

arXiv:2303.03378·cs.LG·March 7, 2023·350 cites

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha, Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe, Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey, Levine, Vincent Vanhoucke, Karol Hausman

PDF

Open Access 2 Repos 1 Video

TL;DR

PaLM-E is a large embodied multimodal language model that integrates visual, sensor, and textual data to perform diverse real-world tasks like robotics, visual question answering, and captioning, demonstrating positive transfer and state-of-the-art performance.

Contribution

This work introduces PaLM-E, a unified large-scale embodied multimodal language model that combines sensor data with language understanding for complex real-world tasks.

Findings

01

PaLM-E achieves state-of-the-art results on OK-VQA.

02

The model benefits from joint training across multiple modalities.

03

PaLM-E retains general language capabilities at scale.

Abstract

Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

PaLM-E: An Embodied Multimodal Language Model· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning