VISITRON: Visual Semantics-Aligned Interactively Trained   Object-Navigator

Ayush Shrivastava; Karthik Gopalakrishnan; Yang Liu; Robinson; Piramuthu; Gokhan T\"ur; Devi Parikh; Dilek Hakkani-T\"ur

arXiv:2105.11589·cs.CV·March 17, 2022·1 cites

VISITRON: Visual Semantics-Aligned Interactively Trained Object-Navigator

Ayush Shrivastava, Karthik Gopalakrishnan, Yang Liu, Robinson, Piramuthu, Gokhan T\"ur, Devi Parikh, Dilek Hakkani-T\"ur

PDF

Open Access 1 Repo

TL;DR

VISITRON is a multi-modal Transformer-based robot navigator designed for interactive vision-and-dialog tasks, effectively leveraging dialogue and visual semantics to improve navigation and interaction decisions in photo-realistic environments.

Contribution

It introduces a novel Transformer-based model that aligns visual semantics with dialogue, and learns when to interact versus navigate, advancing interactive robot navigation.

Findings

01

Achieves state-of-the-art SPL performance on CVDN benchmark.

02

Effectively identifies when to interact, enabling better generalization.

03

Competitive results on static CVDN leaderboard.

Abstract

Interactive robots navigating photo-realistic environments need to be trained to effectively leverage and handle the dynamic nature of dialogue in addition to the challenges underlying vision-and-language navigation (VLN). In this paper, we present VISITRON, a multi-modal Transformer-based navigator better suited to the interactive regime inherent to Cooperative Vision-and-Dialog Navigation (CVDN). VISITRON is trained to: i) identify and associate object-level concepts and semantics between the environment and dialogue history, ii) identify when to interact vs. navigate via imitation learning of a binary classification head. We perform extensive pre-training and fine-tuning ablations with VISITRON to gain empirical insights and improve performance on CVDN. VISITRON's ability to identify when to interact leads to a natural generalization of the game-play mode introduced by Roman et al.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alexa/visitron
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques