Self-supervised 3D Semantic Representation Learning for   Vision-and-Language Navigation

Sinan Tan; Mengmeng Ge; Di Guo; Huaping Liu; Fuchun Sun

arXiv:2201.10788·cs.CV·January 27, 2022·6 cites

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Sinan Tan, Mengmeng Ge, Di Guo, Huaping Liu, Fuchun Sun

PDF

Open Access

TL;DR

This paper introduces a self-supervised learning framework that encodes 3D semantic information into representations to improve vision-and-language navigation, outperforming RGB-based methods.

Contribution

It develops a novel self-supervised training method for 3D semantic encoding and integrates it with an LSTM-based navigation model for enhanced performance.

Findings

01

Achieves 68% success on validation unseen split

02

Achieves 66% success on test unseen split

03

Outperforms most RGB-based vision-language methods

Abstract

In the Vision-and-Language Navigation task, the embodied agent follows linguistic instructions and navigates to a specific goal. It is important in many practical scenarios and has attracted extensive attention from both computer vision and robotics communities. However, most existing works only use RGB images but neglect the 3D semantic information of the scene. To this end, we develop a novel self-supervised training framework to encode the voxel-level 3D semantic reconstruction into a 3D semantic representation. Specifically, a region query task is designed as the pretext task, which predicts the presence or absence of objects of a particular class in a specific 3D region. Then, we construct an LSTM-based navigation model and train it with the proposed 3D semantic representations and BERT language features on vision-language pairs. Experiments show that the proposed approach achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · WordPiece · Weight Decay · Softmax · Dense Connections · Linear Warmup With Linear Decay