Learning Vision-and-Language Navigation from YouTube Videos

Kunyang Lin; Peihao Chen; Diwei Huang; Thomas H. Li; Mingkui Tan,; Chuang Gan

arXiv:2307.11984·cs.CV·July 25, 2023

Learning Vision-and-Language Navigation from YouTube Videos

Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H. Li, Mingkui Tan,, Chuang Gan

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel approach to vision-and-language navigation by leveraging large-scale YouTube house tour videos to create a new dataset, enabling better generalization and state-of-the-art results in navigation tasks.

Contribution

It proposes a method to automatically construct path-instruction pairs from unlabeled videos and pre-trains an agent on this data, addressing limitations of small datasets and improving generalization.

Findings

01

Achieves state-of-the-art on R2R and REVERIE benchmarks.

02

Demonstrates effective use of real-world YouTube videos for VLN.

03

Introduces a trajectory judgment pretext task for layout knowledge mining.

Abstract

Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions. Existing VLN methods suffer from training on small-scale environments or unreasonable path-instruction datasets, limiting the generalization to unseen environments. There are massive house tour videos on YouTube, providing abundant real navigation experiences and layout information. However, these videos have not been explored for VLN before. In this paper, we propose to learn an agent from these videos by creating a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it. To achieve this, we have to tackle the challenges of automatically constructing path-instruction pairs and exploiting real layout knowledge from raw and unlabeled videos. To address these, we first leverage an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jeremylinky/youtube-vln
pytorchOfficial

Videos

Learning Vision-and-Language Navigation from YouTube Videos· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition