TL;DR
This paper introduces CONSOLE, a novel VLN framework that leverages large models ChatGPT and CLIP for open-world landmark discovery, significantly improving navigation accuracy especially in unseen environments.
Contribution
The paper proposes a new VLN paradigm using large models for landmark discovery, introducing a correction mechanism and observation enhancement to improve modality alignment.
Findings
Achieves state-of-the-art results on R2R and R4R benchmarks.
Demonstrates significant improvements in unseen scenarios.
Validates the effectiveness of large models in open-world landmark discovery.
Abstract
Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsALIGN · Contrastive Language-Image Pre-training
