$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting   Vision-and-Language Ability of Foundation Models

Peihao Chen; Xinyu Sun; Hongyan Zhi; Runhao Zeng; Thomas H. Li; Gaowen; Liu; Mingkui Tan; Chuang Gan

arXiv:2308.07997·cs.CV·August 17, 2023·1 cites

$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H. Li, Gaowen, Liu, Mingkui Tan, Chuang Gan

PDF

Open Access

TL;DR

This paper introduces $A^2$Nav, a zero-shot vision-and-language navigation method that leverages foundation models to understand complex instructions and execute navigation tasks without annotated data, surpassing some supervised methods.

Contribution

The paper proposes an action-aware zero-shot VLN framework using foundation models for instruction parsing and sub-task execution, enabling navigation without training data.

Findings

01

$A^2$Nav achieves promising zero-shot VLN performance.

02

It surpasses some supervised methods on R2R-Habitat and RxR-Habitat datasets.

03

The approach effectively decomposes complex instructions into sub-tasks.

Abstract

We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data. Normally, the instructions have complex grammatical structures and often contain various action descriptions (e.g., "proceed beyond", "depart from"). How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging. Note that a well-educated human being can easily understand path instructions without the need for any special training. In this paper, we propose an action-aware zero-shot VLN method ( $A^{2}$ Nav) by exploiting the vision-and-language ability of foundation models. Specifically, the proposed method consists of an instruction parser and an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition