Less is More: Generating Grounded Navigation Instructions from Landmarks
Su Wang, Ceslee Montgomery, Jordi Orbay, Vighnesh Birodkar, Aleksandra, Faust, Izzeddin Gur, Natasha Jaques, Austin Waters, Jason Baldridge, Peter, Anderson

TL;DR
This paper introduces MARKY-MT5, a system that generates accurate, grounded, multilingual navigation instructions from indoor panoramic images, significantly improving over prior methods by focusing on visual landmarks and leveraging large-scale annotated data.
Contribution
The paper presents a novel multimodal, multilingual, multitask encoder-decoder system for grounded navigation instruction generation, trained on a large-scale, weakly supervised landmark dataset.
Findings
Achieves 71% success rate on Room-to-Room navigation task, close to human performance.
Grounded landmark annotations improve instruction quality and grounding accuracy.
Enables multilingual instruction generation with high success rates across languages.
Abstract
We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator -- a multimodal, multilingual, multitask encoder-decoder. To train it, we bootstrap grounded landmark annotations on top of the Room-across-Room (RxR) dataset. Using text parsers, weak supervision from RxR's pose traces, and a multilingual image-text encoder trained on 1.8b images, we identify 971k English, Hindi and Telugu landmark descriptions and ground them to specific regions in panoramas. On Room-to-Room, human wayfinders obtain success rates (SR) of 71% following MARKY-MT5's instructions, just shy of their 75% SR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
