Embodied Executable Policy Learning with Language-based Scene   Summarization

Jielin Qiu; Mengdi Xu; William Han; Seungwhan Moon; Ding Zhao

arXiv:2306.05696·cs.RO·June 12, 2023·1 cites

Embodied Executable Policy Learning with Language-based Scene Summarization

Jielin Qiu, Mengdi Xu, William Han, Seungwhan Moon, Ding Zhao

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel robot learning paradigm that uses language-based scene summarization from visual observations to generate executable actions, eliminating the need for human-labeled data and enabling adaptation through imitation and reinforcement learning.

Contribution

It proposes a new framework combining visual scene summarization and language-based action generation, advancing robot learning without human-involved scene annotation.

Findings

01

Outperforms existing baselines in VirtualHome environments.

02

Effective adaptation using imitation and reinforcement learning.

03

Versatile across various house layouts and tasks.

Abstract

Large Language models (LLMs) have shown remarkable success in assisting robot learning tasks, i.e., complex household planning. However, the performance of pretrained LLMs heavily relies on domain-specific templated text data, which may be infeasible in real-world robot learning tasks with image-based observations. Moreover, existing LLMs with text inputs lack the capability to evolve with non-expert interactions with environments. In this work, we introduce a novel learning paradigm that generates robots' executable actions in the form of text, derived solely from visual observations, using language-based summarization of these observations as the connecting bridge between both domains. Our proposed paradigm stands apart from previous works, which utilized either language instructions or a combination of language and visual data as inputs. Moreover, our method does not require oracle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Embodied Executable Policy Learning with Language-based Scene Summarization· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition