Think Global, Act Local: Dual-scale Graph Transformer for   Vision-and-Language Navigation

Shizhe Chen; Pierre-Louis Guhur; Makarand Tapaswi; Cordelia Schmid and; Ivan Laptev

arXiv:2202.11742·cs.CV·February 25, 2022

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid and, Ivan Laptev

PDF

1 Repo

TL;DR

This paper introduces DUET, a dual-scale graph transformer that enhances vision-and-language navigation by combining global exploration with fine-grained understanding, significantly improving performance on multiple benchmarks.

Contribution

The paper presents a novel dual-scale graph transformer architecture that jointly models global exploration and local language grounding for navigation tasks.

Findings

01

Outperforms state-of-the-art on REVERIE and SOON benchmarks

02

Improves success rate on R2R benchmark

03

Effectively balances global exploration and local grounding

Abstract

Following language instructions to navigate in unseen environments is a challenging problem for autonomous embodied agents. The agent not only needs to ground languages in visual scenes, but also should explore the environment to reach its target. In this work, we propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build a topological map on-the-fly to enable efficient exploration in global action space. To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers. The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation (VLN) benchmarks REVERIE and SOON. It also improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cshizhe/vln-duet
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Layer Normalization · Laplacian EigenMap · Adam · Label Smoothing · Dropout