Large Language Models for Robotics: Opportunities, Challenges, and   Perspectives

Jiaqi Wang; Zihao Wu; Yiwei Li; Hanqi Jiang; Peng Shu; Enze Shi,; Huawen Hu; Chong Ma; Yiheng Liu; Xuhui Wang; Yincheng Yao; Xuan Liu; Huaqin; Zhao; Zhengliang Liu; Haixing Dai; Lin Zhao; Bao Ge; Xiang Li; Tianming Liu,; and Shu Zhang

arXiv:2401.04334·cs.RO·January 10, 2024·20 cites

Large Language Models for Robotics: Opportunities, Challenges, and Perspectives

Jiaqi Wang, Zihao Wu, Yiwei Li, Hanqi Jiang, Peng Shu, Enze Shi,, Huawen Hu, Chong Ma, Yiheng Liu, Xuhui Wang, Yincheng Yao, Xuan Liu, Huaqin, Zhao, Zhengliang Liu, Haixing Dai, Lin Zhao, Bao Ge, Xiang Li, Tianming Liu,, and Shu Zhang

PDF

Open Access

TL;DR

This paper reviews the integration of large language models, especially multimodal ones like GPT-4V, into robotics, highlighting opportunities, challenges, and proposing a framework to improve embodied task planning with visual perception.

Contribution

It provides a comprehensive survey of LLMs in robotics and introduces a novel framework using multimodal GPT-4V to enhance embodied task planning.

Findings

01

GPT-4V improves robot performance in embodied tasks

02

Multimodal LLMs effectively integrate language and visual perception

03

Survey highlights key challenges and future directions in LLM-robot integration

Abstract

Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques