Transformer-based Localization from Embodied Dialog with Large-scale Pre-training
Meera Hahn, James M. Rehg

TL;DR
This paper introduces LED-Bert, a transformer-based model with a novel pretraining strategy for localization via embodied dialog, demonstrating that graph-based scene representations outperform traditional 2D maps and improve localization accuracy.
Contribution
The paper proposes a new LED-Bert architecture with a pretraining strategy and highlights the effectiveness of graph-based scene representations over 2D maps.
Findings
LED-Bert outperforms previous baselines
Graph-based scene representation is more effective than 2D maps
Pretraining strategy enhances localization performance
Abstract
We address the challenging task of Localization via Embodied Dialog (LED). Given a dialog from two agents, an Observer navigating through an unknown environment and a Locator who is attempting to identify the Observer's location, the goal is to predict the Observer's final location in a map. We develop a novel LED-Bert architecture and present an effective pretraining strategy. We show that a graph-based scene representation is more effective than the top-down 2D maps used in prior works. Our approach outperforms previous baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Social Robot Interaction and HRI
