Structure-Encoding Auxiliary Tasks for Improved Visual Representation in   Vision-and-Language Navigation

Chia-Wen Kuo; Chih-Yao Ma; Judy Hoffman; Zsolt Kira

arXiv:2211.11116·cs.CV·November 22, 2022

Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation

Chia-Wen Kuo, Chih-Yao Ma, Judy Hoffman, Zsolt Kira

PDF

Open Access

TL;DR

This paper introduces structure-encoding auxiliary tasks to pre-train image encoders using navigation environment data, significantly enhancing visual representations for improved performance in Vision-and-Language Navigation tasks.

Contribution

It proposes novel auxiliary tasks for pre-training image encoders with environment data, addressing the distribution shift issue in VLN and improving navigation success rates.

Findings

01

SEA pre-trained features encode scene structure better.

02

Improved success rates on Test-Unseen environments.

03

Plug-and-play with existing VLN agents without tuning.

Abstract

In Vision-and-Language Navigation (VLN), researchers typically take an image encoder pre-trained on ImageNet without fine-tuning on the environments that the agent will be trained or tested on. However, the distribution shift between the training images from ImageNet and the views in the navigation environments may render the ImageNet pre-trained image encoder suboptimal. Therefore, in this paper, we design a set of structure-encoding auxiliary tasks (SEA) that leverage the data in the navigation environments to pre-train and improve the image encoder. Specifically, we design and customize (1) 3D jigsaw, (2) traversability prediction, and (3) instance classification to pre-train the image encoder. Through rigorous ablations, our SEA pre-trained features are shown to better encode structural information of the scenes, which ImageNet pre-trained features fail to properly encode but is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

Methodsfail