TL;DR
This paper introduces DUET, a dual-scale graph transformer that enhances vision-and-language navigation by combining global exploration with fine-grained understanding, significantly improving performance on multiple benchmarks.
Contribution
The paper presents a novel dual-scale graph transformer architecture that jointly models global exploration and local language grounding for navigation tasks.
Findings
Outperforms state-of-the-art on REVERIE and SOON benchmarks
Improves success rate on R2R benchmark
Effectively balances global exploration and local grounding
Abstract
Following language instructions to navigate in unseen environments is a challenging problem for autonomous embodied agents. The agent not only needs to ground languages in visual scenes, but also should explore the environment to reach its target. In this work, we propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build a topological map on-the-fly to enable efficient exploration in global action space. To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers. The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation (VLN) benchmarks REVERIE and SOON. It also improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Layer Normalization · Laplacian EigenMap · Adam · Label Smoothing · Dropout
