EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun, Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo

TL;DR
EmbodiedGPT is a multi-modal foundation model that advances embodied AI by integrating large-scale embodied planning, high-quality language instructions, and a closed-loop system for improved task execution in physical environments.
Contribution
This work introduces EmbodiedGPT, a novel end-to-end embodied AI model that combines a large-scale dataset, efficient training, and a new planning-control paradigm for better embodied task performance.
Findings
Significantly improved success rates on benchmark tasks.
Effective integration of high-level planning with low-level control.
Enhanced multi-modal understanding for embodied agents.
Abstract
Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed EgoCOT. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts" mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
