Zero-Shot Vision-and-Language Navigation with Collision Mitigation in   Continuous Environment

Seongjun Jeong; Gi-Cheon Kang; Joochan Kim; Byoung-Tak Zhang

arXiv:2410.17267·cs.CV·October 24, 2024

Zero-Shot Vision-and-Language Navigation with Collision Mitigation in Continuous Environment

Seongjun Jeong, Gi-Cheon Kang, Joochan Kim, Byoung-Tak Zhang

PDF

Open Access

TL;DR

This paper introduces VLN-CM, a zero-shot vision-and-language navigation system that leverages foundation models and collision mitigation techniques to improve navigation accuracy in continuous environments.

Contribution

The paper presents a novel zero-shot VLN framework with modules utilizing large language models and visual similarity for attention and collision avoidance, without task-specific training.

Findings

01

Outperforms baseline methods on VLN-CE validation data

02

Effective collision mitigation using occupancy masks

03

Utilizes foundation models for instruction parsing and scene understanding

Abstract

We propose the zero-shot Vision-and-Language Navigation with Collision Mitigation (VLN-CM), which takes these considerations. VLN-CM is composed of four modules and predicts the direction and distance of the next movement at each step. We utilize large foundation models for each modules. To select the direction, we use the Attention Spot Predictor (ASP), View Selector (VS), and Progress Monitor (PM). The ASP employs a Large Language Model (e.g. ChatGPT) to split navigation instructions into attention spots, which are objects or scenes at the location to move to (e.g. a yellow door). The VS selects from panorama images provided at 30-degree intervals the one that includes the attention spot, using CLIP similarity. We then choose the angle of the selected image as the direction to move in. The PM uses a rule-based approach to decide which attention spot to focus on next, among multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Focus