VTNet: Visual Transformer Network for Object Goal Navigation

Heming Du; Xin Yu; Liang Zheng

arXiv:2105.09447·cs.CV·May 21, 2021·36 cites

VTNet: Visual Transformer Network for Object Goal Navigation

Heming Du, Xin Yu, Liang Zheng

PDF

Open Access 1 Video

TL;DR

This paper introduces VTNet, a visual transformer network that enhances object goal navigation by capturing relationships and spatial cues among objects, leading to improved navigation performance in unseen environments.

Contribution

The paper proposes a novel visual transformer architecture with a pre-training scheme that effectively encodes spatial and relational information for navigation tasks.

Findings

01

VTNet outperforms existing methods in AI2-Thor environments.

02

The spatial-aware descriptors improve navigation decision-making.

03

Pre-training enhances the association between visual features and navigation signals.

Abstract

Object goal navigation aims to steer an agent towards a target object based on observations of the agent. It is of pivotal importance to design effective visual representations of the observed scene in determining navigation actions. In this paper, we introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation. VTNet is a highly effective structure that embodies two key properties for visual representations: First, the relationships among all the object instances in a scene are exploited; Second, the spatial locations of objects and image regions are emphasized so that directional navigation signals can be learned. Furthermore, we also develop a pre-training scheme to associate the visual representations with navigation signals, and thus facilitate navigation policy learning. In a nutshell, VTNet embeds object and region features with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VTNet: Visual Transformer Network for Object Goal Navigation· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Residual Connection · Adam · Layer Normalization · Dense Connections