HumanVLA: Towards Vision-Language Directed Object Rearrangement by   Physical Humanoid

Xinyu Xu; Yizheng Zhang; Yong-Lu Li; Lei Han; Cewu Lu

arXiv:2406.19972·cs.RO·November 14, 2024·2 cites

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

Xinyu Xu, Yizheng Zhang, Yong-Lu Li, Lei Han, Cewu Lu

PDF

Open Access 1 Repo 1 Video

TL;DR

HumanVLA enables a humanoid robot to perform general object rearrangement tasks using vision and language, leveraging a teacher-student framework and a new dataset for training and evaluation.

Contribution

We introduce HumanVLA, a novel framework combining reinforcement learning and behavior cloning for vision-language guided object rearrangement by humanoids, supported by a new comprehensive dataset.

Findings

01

Effective in diverse rearrangement tasks

02

Outperforms baseline methods

03

Demonstrates generalization to unseen objects

Abstract

Physical Human-Scene Interaction (HSI) plays a crucial role in numerous applications. However, existing HSI techniques are limited to specific object dynamics and privileged information, which prevents the development of more comprehensive applications. To address this limitation, we introduce HumanVLA for general object rearrangement directed by practical vision and language. A teacher-student framework is utilized to develop HumanVLA. A state-based teacher policy is trained first using goal-conditioned reinforcement learning and adversarial motion prior. Then, it is distilled into a vision-language-action model via behavior cloning. We propose several key insights to facilitate the large-scale learning process. To support general object rearrangement by physical humanoid, we introduce a novel Human-in-the-Room dataset encompassing various rearrangement tasks. Through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AllenXuuu/HumanVLA
pytorchOfficial

Videos

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition