Spatially-Aware Speaker for Vision-and-Language Navigation Instruction   Generation

Muraleekrishna Gopinathan; Martin Masek; Jumana Abu-Khalaf; David; Suter

arXiv:2409.05583·cs.CL·September 10, 2024

Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation

Muraleekrishna Gopinathan, Martin Masek, Jumana Abu-Khalaf, David, Suter

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces SAS, a spatially-aware speaker model that generates detailed, diverse navigation instructions for embodied robots by leveraging environmental knowledge and adversarial training to improve instruction quality.

Contribution

The paper presents SAS, a novel instruction generator that incorporates structural and semantic environment knowledge and uses adversarial reward learning to enhance instruction diversity and quality.

Findings

01

Outperforms existing models on standard metrics

02

Produces more detailed and varied navigation instructions

03

Utilizes environment knowledge for instruction generation

Abstract

Embodied AI aims to develop robots that can \textit{understand} and execute human language instructions, as well as communicate in natural languages. On this front, we study the task of generating highly detailed navigational instructions for the embodied robots to follow. Although recent studies have demonstrated significant leaps in the generation of step-by-step instructions from sequences of images, the generated instructions lack variety in terms of their referral to objects and landmarks. Existing speaker models learn strategies to evade the evaluation metrics and obtain higher scores even for low-quality sentences. In this work, we propose SAS (Spatially-Aware Speaker), an instruction generator or \textit{Speaker} model that utilises both structural and semantic knowledge of the environment to produce richer instructions. For training, we employ a reward learning method in an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gmuraleekrishna/sas
pytorchOfficial

Videos

Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation· underline

Taxonomy

TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Robotics and Automated Systems