VTNet: Visual Transformer Network for Object Goal Navigation
Heming Du, Xin Yu, Liang Zheng

TL;DR
This paper introduces VTNet, a visual transformer network that enhances object goal navigation by capturing relationships and spatial cues among objects, leading to improved navigation performance in unseen environments.
Contribution
The paper proposes a novel visual transformer architecture with a pre-training scheme that effectively encodes spatial and relational information for navigation tasks.
Findings
VTNet outperforms existing methods in AI2-Thor environments.
The spatial-aware descriptors improve navigation decision-making.
Pre-training enhances the association between visual features and navigation signals.
Abstract
Object goal navigation aims to steer an agent towards a target object based on observations of the agent. It is of pivotal importance to design effective visual representations of the observed scene in determining navigation actions. In this paper, we introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation. VTNet is a highly effective structure that embodies two key properties for visual representations: First, the relationships among all the object instances in a scene are exploited; Second, the spatial locations of objects and image regions are emphasized so that directional navigation signals can be learned. Furthermore, we also develop a pre-training scheme to associate the visual representations with navigation signals, and thus facilitate navigation policy learning. In a nutshell, VTNet embeds object and region features with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Residual Connection · Adam · Layer Normalization · Dense Connections
