Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for   Navigation Instruction Generation

Haitian Zeng; Xiaohan Wang; Wenguan Wang; Yi Yang

arXiv:2307.13368·cs.CV·July 26, 2023

Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation

Haitian Zeng, Xiaohan Wang, Wenguan Wang, Yi Yang

PDF

Open Access 1 Repo

TL;DR

Kefa is a novel speaker model for navigation instruction generation that enhances feature representation with external knowledge and improves temporal alignment, achieving state-of-the-art results on multiple datasets.

Contribution

Introduces Knowledge Refinement Module and Adaptive Temporal Alignment for improved navigation instruction generation.

Findings

01

Achieves state-of-the-art performance on R2R and UrbanWalk datasets.

02

Proposes SPICE-D metric for better instruction evaluation.

03

Enhances instruction quality in both indoor and outdoor scenes.

Abstract

We introduce a novel speaker model \textsc{Kefa} for navigation instruction generation. The existing speaker models in Vision-and-Language Navigation suffer from the large domain gap of vision features between different environments and insufficient temporal grounding capability. To address the challenges, we propose a Knowledge Refinement Module to enhance the feature representation with external knowledge facts, and an Adaptive Temporal Alignment method to enforce fine-grained alignment between the generated instructions and the observation sequences. Moreover, we propose a new metric SPICE-D for navigation instruction evaluation, which is aware of the correctness of direction phrases. The experimental results on R2R and UrbanWalk datasets show that the proposed KEFA speaker achieves state-of-the-art instruction generation performance for both indoor and outdoor scenes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

haitianzeng/KEFA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques

MethodsAttentive Walk-Aggregating Graph Neural Network