Transformer-based Localization from Embodied Dialog with Large-scale   Pre-training

Meera Hahn; James M. Rehg

arXiv:2210.04864·cs.CV·October 11, 2022·1 cites

Transformer-based Localization from Embodied Dialog with Large-scale Pre-training

Meera Hahn, James M. Rehg

PDF

Open Access

TL;DR

This paper introduces LED-Bert, a transformer-based model with a novel pretraining strategy for localization via embodied dialog, demonstrating that graph-based scene representations outperform traditional 2D maps and improve localization accuracy.

Contribution

The paper proposes a new LED-Bert architecture with a pretraining strategy and highlights the effectiveness of graph-based scene representations over 2D maps.

Findings

01

LED-Bert outperforms previous baselines

02

Graph-based scene representation is more effective than 2D maps

03

Pretraining strategy enhances localization performance

Abstract

We address the challenging task of Localization via Embodied Dialog (LED). Given a dialog from two agents, an Observer navigating through an unknown environment and a Locator who is attempting to identify the Observer's location, the goal is to predict the Observer's final location in a map. We develop a novel LED-Bert architecture and present an effective pretraining strategy. We show that a graph-based scene representation is more effective than the top-down 2D maps used in prior works. Our approach outperforms previous baselines.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Social Robot Interaction and HRI