Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions   with Large Language Model

Siyuan Huang; Zhengkai Jiang; Hao Dong; Yu Qiao; Peng Gao; Hongsheng; Li

arXiv:2305.11176·cs.RO·May 25, 2023·30 cites

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

Siyuan Huang, Zhengkai Jiang, Hao Dong, Yu Qiao, Peng Gao, Hongsheng, Li

PDF

Open Access 1 Repo

TL;DR

Instruct2Act leverages large language models to translate multi-modal instructions into robotic actions by generating Python programs, integrating foundation models for perception, and demonstrating superior zero-shot performance in tabletop manipulation tasks.

Contribution

This work introduces a flexible framework that maps multi-modal instructions to robotic actions using LLMs and foundation models, advancing zero-shot capabilities in manipulation tasks.

Findings

01

Outperforms state-of-the-art policies in several tasks

02

Effective in zero-shot scenarios

03

Flexible multi-modal instruction handling

Abstract

Foundation models have made significant strides in various applications, including text-to-image generation, panoptic segmentation, and natural language processing. This paper presents Instruct2Act, a framework that utilizes Large Language Models to map multi-modal instructions to sequential actions for robotic manipulation tasks. Specifically, Instruct2Act employs the LLM model to generate Python programs that constitute a comprehensive perception, planning, and action loop for robotic tasks. In the perception section, pre-defined APIs are used to access multiple foundation models where the Segment Anything Model (SAM) accurately locates candidate objects, and CLIP classifies them. In this way, the framework leverages the expertise of foundation models and robotic abilities to convert complex high-level instructions into precise policy codes. Our approach is adjustable and flexible in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opengvlab/instruct2act
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training