Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations
Yibo Cui, Liang Xie, Yu Zhao, Jiawei Sun, Erwei Yin

TL;DR
This paper introduces FCA-NIG, a generative framework that automatically creates fine-grained, annotated vision-language navigation instructions, significantly improving agent performance by providing detailed sub-instruction and entity-landmark alignments.
Contribution
The paper presents FCA-NIG, the first large-scale dataset augmentation method for fine-grained cross-modal alignments in VLN, enhancing training data quality without manual annotation.
Findings
Training with FCA-R2R improves VLN agent performance.
Sub-instruction alignment increases state awareness and decision accuracy.
Entity-landmark alignment enhances navigation performance and generalization.
Abstract
Vision-Language Navigation (VLN) enables intelligent agents to navigate environments by integrating visual perception and natural language instructions, yet faces significant challenges due to the scarcity of fine-grained cross-modal alignment annotations. Existing datasets primarily focus on global instruction-trajectory matching, neglecting sub-instruction-level and entity-level alignments critical for accurate navigation action decision-making. To address this limitation, we propose FCA-NIG, a generative framework that automatically constructs navigation instructions with dual-level fine-grained cross-modal annotations. In this framework, an augmented trajectory is first divided into sub-trajectories, which are then processed through GLIP-based landmark detection, crafted instruction construction, OFA-Speaker based R2R-like instruction generation, and CLIP-powered entity selection,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Topic Modeling
MethodsFocus
