Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation
Haitian Zeng, Xiaohan Wang, Wenguan Wang, Yi Yang

TL;DR
Kefa is a novel speaker model for navigation instruction generation that enhances feature representation with external knowledge and improves temporal alignment, achieving state-of-the-art results on multiple datasets.
Contribution
Introduces Knowledge Refinement Module and Adaptive Temporal Alignment for improved navigation instruction generation.
Findings
Achieves state-of-the-art performance on R2R and UrbanWalk datasets.
Proposes SPICE-D metric for better instruction evaluation.
Enhances instruction quality in both indoor and outdoor scenes.
Abstract
We introduce a novel speaker model \textsc{Kefa} for navigation instruction generation. The existing speaker models in Vision-and-Language Navigation suffer from the large domain gap of vision features between different environments and insufficient temporal grounding capability. To address the challenges, we propose a Knowledge Refinement Module to enhance the feature representation with external knowledge facts, and an Adaptive Temporal Alignment method to enforce fine-grained alignment between the generated instructions and the observation sequences. Moreover, we propose a new metric SPICE-D for navigation instruction evaluation, which is aware of the correctness of direction phrases. The experimental results on R2R and UrbanWalk datasets show that the proposed KEFA speaker achieves state-of-the-art instruction generation performance for both indoor and outdoor scenes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
MethodsAttentive Walk-Aggregating Graph Neural Network
