EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Yao Mu; Qinglong Zhang; Mengkang Hu; Wenhai Wang; Mingyu Ding; Jun; Jin; Bin Wang; Jifeng Dai; Yu Qiao; Ping Luo

arXiv:2305.15021·cs.RO·September 15, 2023·41 cites

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun, Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo

PDF

Open Access 1 Video

TL;DR

EmbodiedGPT is a multi-modal foundation model that advances embodied AI by integrating large-scale embodied planning, high-quality language instructions, and a closed-loop system for improved task execution in physical environments.

Contribution

This work introduces EmbodiedGPT, a novel end-to-end embodied AI model that combines a large-scale dataset, efficient training, and a new planning-control paradigm for better embodied task performance.

Findings

01

Significantly improved success rates on benchmark tasks.

02

Effective integration of high-level planning with low-level control.

03

Enhanced multi-modal understanding for embodied agents.

Abstract

Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed EgoCOT. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts" mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning