Correctable Landmark Discovery via Large Models for Vision-Language   Navigation

Bingqian Lin; Yunshuang Nie; Ziming Wei; Yi Zhu; Hang Xu; Shikui Ma,; Jianzhuang Liu; Xiaodan Liang

arXiv:2405.18721·cs.CV·June 6, 2024

Correctable Landmark Discovery via Large Models for Vision-Language Navigation

Bingqian Lin, Yunshuang Nie, Ziming Wei, Yi Zhu, Hang Xu, Shikui Ma,, Jianzhuang Liu, Xiaodan Liang

PDF

1 Repo

TL;DR

This paper introduces CONSOLE, a novel VLN framework that leverages large models ChatGPT and CLIP for open-world landmark discovery, significantly improving navigation accuracy especially in unseen environments.

Contribution

The paper proposes a new VLN paradigm using large models for landmark discovery, introducing a correction mechanism and observation enhancement to improve modality alignment.

Findings

01

Achieves state-of-the-art results on R2R and R4R benchmarks.

02

Demonstrates significant improvements in unseen scenarios.

03

Validates the effectiveness of large models in open-world landmark discovery.

Abstract

Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

expectorlin/console
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN · Contrastive Language-Image Pre-training