Human-Aware Vision-and-Language Navigation: Bridging Simulation to   Reality with Dynamic Human Interactions

Heng Li; Minghan Li; Zhi-Qi Cheng; Yifei Dong; Yuxuan Zhou; Jun-Yan; He; Qi Dai; Teruko Mitamura; Alexander G. Hauptmann

arXiv:2406.19236·cs.AI·November 5, 2024

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

Heng Li, Minghan Li, Zhi-Qi Cheng, Yifei Dong, Yuxuan Zhou, Jun-Yan, He, Qi Dai, Teruko Mitamura, Alexander G. Hauptmann

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN to include dynamic human activities and real-world scenarios, with new datasets, simulation tools, and navigation agents.

Contribution

The work presents the HA3D simulator, HA-R2R dataset, and novel navigation agents that incorporate human activity awareness, bridging the gap between simulation and real-world applications.

Findings

01

HA3D simulator effectively models dynamic human activities.

02

HA-R2R dataset extends R2R with human activity descriptions.

03

Proposed agents demonstrate improved navigation in human-rich environments.

Abstract

Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lpercc/ha3d_simulator
pytorchOfficial

Videos

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems

MethodsAttention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer · Absolute Position Encodings