DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation
Yinfeng Yu, Dongsheng Yang

TL;DR
The paper introduces DOPE, a novel network that enhances vision-and-language navigation by better extracting instruction details and modeling object relationships across modalities, leading to improved navigation accuracy.
Contribution
The paper proposes a dual object perception-enhancement network with text and image modules to better utilize instruction details and object relationships in VLN tasks.
Findings
Improved navigation accuracy on R2R and REVERIE datasets.
Enhanced understanding of object relationships across modalities.
Effective extraction of essential instruction phrases.
Abstract
Vision-and-Language Navigation (VLN) is a challenging task where an agent must understand language instructions and navigate unfamiliar environments using visual cues. The agent must accurately locate the target based on visual information from the environment and complete tasks through interaction with the surroundings. Despite significant advancements in this field, two major limitations persist: (1) Many existing methods input complete language instructions directly into multi-layer Transformer networks without fully exploiting the detailed information within the instructions, thereby limiting the agent's language understanding capabilities during task execution; (2) Current approaches often overlook the modeling of object relationships across different modalities, failing to effectively utilize latent clues between objects, which affects the accuracy and robustness of navigation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Multi-Head Attention · Dense Connections · Adam · Attention Is All You Need · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax
