Learning Vision-and-Language Navigation from YouTube Videos
Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H. Li, Mingkui Tan,, Chuang Gan

TL;DR
This paper introduces a novel approach to vision-and-language navigation by leveraging large-scale YouTube house tour videos to create a new dataset, enabling better generalization and state-of-the-art results in navigation tasks.
Contribution
It proposes a method to automatically construct path-instruction pairs from unlabeled videos and pre-trains an agent on this data, addressing limitations of small datasets and improving generalization.
Findings
Achieves state-of-the-art on R2R and REVERIE benchmarks.
Demonstrates effective use of real-world YouTube videos for VLN.
Introduces a trajectory judgment pretext task for layout knowledge mining.
Abstract
Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions. Existing VLN methods suffer from training on small-scale environments or unreasonable path-instruction datasets, limiting the generalization to unseen environments. There are massive house tour videos on YouTube, providing abundant real navigation experiences and layout information. However, these videos have not been explored for VLN before. In this paper, we propose to learn an agent from these videos by creating a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it. To achieve this, we have to tackle the challenges of automatically constructing path-instruction pairs and exploiting real layout knowledge from raw and unlabeled videos. To address these, we first leverage an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Learning Vision-and-Language Navigation from YouTube Videos· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
