TL;DR
This paper presents 'Talk The Walk,' a large-scale grounded dialogue dataset where two agents communicate via natural language to navigate New York City, introducing a novel grounding mechanism and establishing baseline results.
Contribution
It introduces a new dataset and task for grounded dialogue navigation, along with a novel grounding mechanism called MASC and baseline results.
Findings
MASC improves grounding accuracy in navigation tasks
The dataset enables research on natural language grounded in real-world environments
Baseline models demonstrate the task's complexity and potential for future work
Abstract
We introduce "Talk The Walk", the first large-scale dialogue dataset grounded in action and perception. The task involves two agents (a "guide" and a "tourist") that communicate via natural language in order to achieve a common goal: having the tourist navigate to a given target location. The task and dataset, which are described in detail, are challenging and their full solution is an open problem that we pose to the community. We (i) focus on the task of tourist localization and develop the novel Masked Attention for Spatial Convolutions (MASC) mechanism that allows for grounding tourist utterances into the guide's map, (ii) show it yields significant improvements for both emergent and natural language communication, and (iii) using this method, we establish non-trivial baselines on the full task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
