Less is More: Generating Grounded Navigation Instructions from Landmarks

Su Wang; Ceslee Montgomery; Jordi Orbay; Vighnesh Birodkar; Aleksandra; Faust; Izzeddin Gur; Natasha Jaques; Austin Waters; Jason Baldridge; Peter; Anderson

arXiv:2111.12872·cs.CV·April 6, 2022·1 cites

Less is More: Generating Grounded Navigation Instructions from Landmarks

Su Wang, Ceslee Montgomery, Jordi Orbay, Vighnesh Birodkar, Aleksandra, Faust, Izzeddin Gur, Natasha Jaques, Austin Waters, Jason Baldridge, Peter, Anderson

PDF

Open Access

TL;DR

This paper introduces MARKY-MT5, a system that generates accurate, grounded, multilingual navigation instructions from indoor panoramic images, significantly improving over prior methods by focusing on visual landmarks and leveraging large-scale annotated data.

Contribution

The paper presents a novel multimodal, multilingual, multitask encoder-decoder system for grounded navigation instruction generation, trained on a large-scale, weakly supervised landmark dataset.

Findings

01

Achieves 71% success rate on Room-to-Room navigation task, close to human performance.

02

Grounded landmark annotations improve instruction quality and grounding accuracy.

03

Enables multilingual instruction generation with high success rates across languages.

Abstract

We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator -- a multimodal, multilingual, multitask encoder-decoder. To train it, we bootstrap grounded landmark annotations on top of the Room-across-Room (RxR) dataset. Using text parsers, weak supervision from RxR's pose traces, and a multilingual image-text encoder trained on 1.8b images, we identify 971k English, Hindi and Telugu landmark descriptions and ground them to specific regions in panoramas. On Room-to-Room, human wayfinders obtain success rates (SR) of 71% following MARKY-MT5's instructions, just shy of their 75% SR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition