Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language   Navigation

Yibo Cui; Liang Xie; Yakun Zhang; Meishan Zhang; Ye Yan; Erwei Yin

arXiv:2308.12587·cs.CV·August 25, 2023

Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation

Yibo Cui, Liang Xie, Yakun Zhang, Meishan Zhang, Ye Yan, Erwei Yin

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel pre-training paradigm called GELA for Vision-and-Language Navigation, focusing on fine-grained entity-landmark alignment to improve navigation accuracy.

Contribution

It proposes grounded entity-landmark annotations and adaptive pre-training objectives, enhancing cross-modal alignment at the entity level in VLN tasks.

Findings

01

Achieves state-of-the-art results on R2R and CVDN benchmarks.

02

Demonstrates improved fine-grained entity-landmark alignment.

03

Validates effectiveness and generalizability of GELA approach.

Abstract

Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN). Most existing studies concentrate on mapping the global instruction or single sub-instruction to the corresponding trajectory. However, another critical problem of achieving fine-grained alignment at the entity level is seldom considered. To address this problem, we propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks. To achieve the adaptive pre-training paradigm, we first introduce grounded entity-landmark human annotations into the Room-to-Room (R2R) dataset, named GEL-R2R. Additionally, we adopt three grounded entity-landmark adaptive pre-training objectives: 1) entity phrase prediction, 2) landmark bounding box prediction, and 3) entity-landmark semantic alignment, which explicitly supervise the learning of fine-grained cross-modal alignment between entity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

csir1996/vln-gela
pytorchOfficial

Videos

Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques