Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation
Sinan Tan, Mengmeng Ge, Di Guo, Huaping Liu, Fuchun Sun

TL;DR
This paper introduces a self-supervised learning framework that encodes 3D semantic information into representations to improve vision-and-language navigation, outperforming RGB-based methods.
Contribution
It develops a novel self-supervised training method for 3D semantic encoding and integrates it with an LSTM-based navigation model for enhanced performance.
Findings
Achieves 68% success on validation unseen split
Achieves 66% success on test unseen split
Outperforms most RGB-based vision-language methods
Abstract
In the Vision-and-Language Navigation task, the embodied agent follows linguistic instructions and navigates to a specific goal. It is important in many practical scenarios and has attracted extensive attention from both computer vision and robotics communities. However, most existing works only use RGB images but neglect the 3D semantic information of the scene. To this end, we develop a novel self-supervised training framework to encode the voxel-level 3D semantic reconstruction into a 3D semantic representation. Specifically, a region query task is designed as the pretext task, which predicts the presence or absence of objects of a particular class in a specific 3D region. Then, we construct an LSTM-based navigation model and train it with the proposed 3D semantic representations and BERT language features on vision-language pairs. Experiments show that the proposed approach achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · WordPiece · Weight Decay · Softmax · Dense Connections · Linear Warmup With Linear Decay
