Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation
Ting Liu, Yue Hu, Wansen Wu, Youkai Wang, Kai Xu, Quanjun Yin

TL;DR
This paper introduces PANDA, a novel pretraining framework that enhances visual-language navigation by incorporating indoor context-awareness and sequence-level semantics through prompt-based tuning, leading to improved navigation performance.
Contribution
PANDA employs a two-stage prompting approach to make pretrained models more sensitive to indoor environments and contextual relations in VLN tasks, which is a novel adaptation strategy.
Findings
PANDA outperforms existing methods on R2R and REVERIE datasets.
Indoor-aware prompts improve sample efficiency in VLN.
Context prompts enhance understanding of instruction sequences.
Abstract
Pretrained visual-language models have extensive world knowledge and are widely used in visual and language navigation (VLN). However, they are not sensitive to indoor scenarios for VLN tasks. Another challenge for VLN is how the agent understands the contextual relations between actions on a path and performs cross-modal alignment sequentially. In this paper, we propose a novel Prompt-bAsed coNtext- and inDoor-Aware (PANDA) pretraining framework to address these problems. It performs prompting in two stages. In the indoor-aware stage, we apply an efficient tuning paradigm to learn deep visual prompts from an indoor dataset, in order to augment pretrained models with inductive biases towards indoor environments. This can enable more sample-efficient adaptation for VLN agents. Furthermore, in the context-aware stage, we design a set of hard context prompts to capture the sequence-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
