NaVid: Video-based VLM Plans the Next Step for Vision-and-Language   Navigation

Jiazhao Zhang; Kunyu Wang; Rongtao Xu; Gengze Zhou; Yicong Hong,; Xiaomeng Fang; Qi Wu; Zhizheng Zhang; He Wang

arXiv:2402.15852·cs.CV·July 2, 2024·2 cites

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong,, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, He Wang

PDF

Open Access

TL;DR

NaVid introduces a video-based vision-language model that enables agents to navigate using only monocular video streams, achieving state-of-the-art results and better generalization in unseen environments without relying on maps or depth sensors.

Contribution

NaVid is the first VLM to perform VLN tasks using only real-time video input, improving generalization and reducing reliance on traditional navigation aids.

Findings

01

NaVid achieves state-of-the-art navigation performance in simulation and real-world environments.

02

NaVid demonstrates superior cross-dataset and Sim2Real transfer capabilities.

03

NaVid effectively encodes spatio-temporal context from video streams for navigation.

Abstract

Vision-and-language navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavor to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometers, or depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHistorical Geography and Cartography · Constraint Satisfaction and Optimization