DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language   Navigation

Yinfeng Yu; Dongsheng Yang

arXiv:2505.00743·cs.CV·May 5, 2025

DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation

Yinfeng Yu, Dongsheng Yang

PDF

TL;DR

The paper introduces DOPE, a novel network that enhances vision-and-language navigation by better extracting instruction details and modeling object relationships across modalities, leading to improved navigation accuracy.

Contribution

The paper proposes a dual object perception-enhancement network with text and image modules to better utilize instruction details and object relationships in VLN tasks.

Findings

01

Improved navigation accuracy on R2R and REVERIE datasets.

02

Enhanced understanding of object relationships across modalities.

03

Effective extraction of essential instruction phrases.

Abstract

Vision-and-Language Navigation (VLN) is a challenging task where an agent must understand language instructions and navigate unfamiliar environments using visual cues. The agent must accurately locate the target based on visual information from the environment and complete tasks through interaction with the surroundings. Despite significant advancements in this field, two major limitations persist: (1) Many existing methods input complete language instructions directly into multi-layer Transformer networks without fully exploiting the detailed information within the instructions, thereby limiting the agent's language understanding capabilities during task execution; (2) Current approaches often overlook the modeling of object relationships across different modalities, failing to effectively utilize latent clues between objects, which affects the accuracy and robustness of navigation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Multi-Head Attention · Dense Connections · Adam · Attention Is All You Need · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax