Vision-and-Language Navigation Generative Pretrained Transformer

Wen Hanlin

arXiv:2405.16994·cs.AI·May 28, 2024

Vision-and-Language Navigation Generative Pretrained Transformer

Wen Hanlin

PDF

Open Access

TL;DR

VLN-GPT introduces a transformer-based approach for vision-and-language navigation that simplifies model architecture, improves efficiency, and outperforms existing encoder-based models on VLN tasks.

Contribution

The paper presents a novel transformer decoder model for VLN that eliminates the need for explicit historical encoding, combining pre-training and fine-tuning for enhanced performance.

Findings

01

VLN-GPT surpasses state-of-the-art encoder-based models.

02

The model achieves higher navigation accuracy.

03

Efficient trajectory modeling without explicit history encoding.

Abstract

In the Vision-and-Language Navigation (VLN) field, agents are tasked with navigating real-world scenes guided by linguistic instructions. Enabling the agent to adhere to instructions throughout the process of navigation represents a significant challenge within the domain of VLN. To address this challenge, common approaches often rely on encoders to explicitly record past locations and actions, increasing model complexity and resource consumption. Our proposal, the Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT), adopts a transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules. This method allows for direct historical information access through trajectory sequence, enhancing efficiency. Furthermore, our model separates the training process into offline pre-training with imitation learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization

MethodsLinear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections