Vision-and-Language Navigation Generative Pretrained Transformer
Wen Hanlin

TL;DR
VLN-GPT introduces a transformer-based approach for vision-and-language navigation that simplifies model architecture, improves efficiency, and outperforms existing encoder-based models on VLN tasks.
Contribution
The paper presents a novel transformer decoder model for VLN that eliminates the need for explicit historical encoding, combining pre-training and fine-tuning for enhanced performance.
Findings
VLN-GPT surpasses state-of-the-art encoder-based models.
The model achieves higher navigation accuracy.
Efficient trajectory modeling without explicit history encoding.
Abstract
In the Vision-and-Language Navigation (VLN) field, agents are tasked with navigating real-world scenes guided by linguistic instructions. Enabling the agent to adhere to instructions throughout the process of navigation represents a significant challenge within the domain of VLN. To address this challenge, common approaches often rely on encoders to explicitly record past locations and actions, increasing model complexity and resource consumption. Our proposal, the Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT), adopts a transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules. This method allows for direct historical information access through trajectory sequence, enhancing efficiency. Furthermore, our model separates the training process into offline pre-training with imitation learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
MethodsLinear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
