Grounding Complex Navigational Instructions Using Scene Graphs

Michiel de Jong; Satyapriya Krishna; Anuva Agarwal

arXiv:2106.01607·cs.LG·June 4, 2021

Grounding Complex Navigational Instructions Using Scene Graphs

Michiel de Jong, Satyapriya Krishna, Anuva Agarwal

PDF

Open Access

TL;DR

This paper introduces a new dataset of complex navigation instructions paired with scene graphs, enabling training of reinforcement learning agents to understand and execute natural language commands in visual environments.

Contribution

It adapts the CLEVR dataset to generate complex instructions and scene graphs, and demonstrates training an agent in VizDoom to follow these instructions.

Findings

01

Successfully generated a supervised dataset for navigation tasks

02

Trained an agent to interpret complex language instructions in VizDoom

03

Showed the feasibility of scene graph-based instruction grounding

Abstract

Training a reinforcement learning agent to carry out natural language instructions is limited by the available supervision, i.e. knowing when the instruction has been carried out. We adapt the CLEVR visual question answering dataset to generate complex natural language navigation instructions and accompanying scene graphs, yielding an environment-agnostic supervised dataset. To demonstrate the use of this data set, we map the scenes to the VizDoom environment and use the architecture in \citet{gatedattention} to train an agent to carry out these more complex language instructions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Human Pose and Action Recognition