OVER-NAV: Elevating Iterative Vision-and-Language Navigation with   Open-Vocabulary Detection and StructurEd Representation

Ganlong Zhao; Guanbin Li; Weikai Chen; Yizhou Yu

arXiv:2403.17334·cs.CV·March 27, 2024·1 cites

OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation

Ganlong Zhao, Guanbin Li, Weikai Chen, Yizhou Yu

PDF

Open Access

TL;DR

OVER-NAV advances vision-and-language navigation by integrating open-vocabulary detection, large language models, and structured representations to improve generalization, memory utilization, and navigation accuracy in diverse environments.

Contribution

It introduces a novel framework combining open-vocabulary detection, LLMs, and structured omnigraph representations for enhanced IVLN performance.

Findings

01

Outperforms existing IVLN methods in diverse environments.

02

Enables generalization to unseen scenes without extra annotations.

03

Supports both discrete and continuous navigation environments.

Abstract

Recent advances in Iterative Vision-and-Language Navigation (IVLN) introduce a more meaningful and practical paradigm of VLN by maintaining the agent's memory across tours of scenes. Although the long-term memory aligns better with the persistent nature of the VLN task, it poses more challenges on how to utilize the highly unstructured navigation memory with extremely sparse supervision. Towards this end, we propose OVER-NAV, which aims to go over and beyond the current arts of IVLN techniques. In particular, we propose to incorporate LLMs and open-vocabulary detectors to distill key information and establish correspondence between multi-modal signals. Such a mechanism introduces reliable cross-modal supervision and enables on-the-fly generalization to unseen scenes without the need of extra annotation and re-training. To fully exploit the interpreted navigation data, we further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques