Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations

Yibo Cui; Liang Xie; Yu Zhao; Jiawei Sun; Erwei Yin

arXiv:2506.08566·cs.CV·June 11, 2025

Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations

Yibo Cui, Liang Xie, Yu Zhao, Jiawei Sun, Erwei Yin

PDF

Open Access

TL;DR

This paper introduces FCA-NIG, a generative framework that automatically creates fine-grained, annotated vision-language navigation instructions, significantly improving agent performance by providing detailed sub-instruction and entity-landmark alignments.

Contribution

The paper presents FCA-NIG, the first large-scale dataset augmentation method for fine-grained cross-modal alignments in VLN, enhancing training data quality without manual annotation.

Findings

01

Training with FCA-R2R improves VLN agent performance.

02

Sub-instruction alignment increases state awareness and decision accuracy.

03

Entity-landmark alignment enhances navigation performance and generalization.

Abstract

Vision-Language Navigation (VLN) enables intelligent agents to navigate environments by integrating visual perception and natural language instructions, yet faces significant challenges due to the scarcity of fine-grained cross-modal alignment annotations. Existing datasets primarily focus on global instruction-trajectory matching, neglecting sub-instruction-level and entity-level alignments critical for accurate navigation action decision-making. To address this limitation, we propose FCA-NIG, a generative framework that automatically constructs navigation instructions with dual-level fine-grained cross-modal annotations. In this framework, an augmented trajectory is first divided into sub-trajectories, which are then processed through GLIP-based landmark detection, crafted instruction construction, OFA-Speaker based R2R-like instruction generation, and CLIP-powered entity selection,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Topic Modeling

MethodsFocus