TL;DR
This paper introduces a novel local slot attention mechanism for vision-and-language navigation, improving how models process visual information by focusing on object segmentation and spatially restricted attention, leading to state-of-the-art results.
Contribution
The paper proposes a new slot-attention module and local attention mask for VLN, enhancing object integrity and spatial focus in transformer-based models.
Findings
Achieved state-of-the-art results on the R2R dataset.
Improved integration of object segmentation information.
Reduced noise by restricting visual attention span.
Abstract
Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments. Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language. However, there exist two problems in current transformer-based models. 1) The models process each view independently without taking the integrity of the objects into account. 2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsBalanced Selection
