VANP: Learning Where to See for Navigation with Self-Supervised Vision-Action Pre-Training
Mohammad Nazeri, Junzhe Wang, Amirreza Payandeh, and Xuesu Xiao

TL;DR
VANP introduces a self-supervised vision-action pre-training method that enables robots to focus on navigation-relevant visual regions, reducing training time and data requirements compared to traditional supervised approaches.
Contribution
This work presents VANP, a novel self-supervised model that learns navigation-specific visual features using mutual information maximization, without relying on large labeled datasets.
Findings
VANP achieves comparable navigation performance with half the training time.
VANP requires only 0.08% of ImageNet data for training.
Features learned by VANP align with human navigation intuition.
Abstract
Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation. However, most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects -- not necessarily relevant to navigation and potentially misleading. Alternative approaches train specialized navigation models from scratch, requiring significant computation. On the other hand, self-supervised learning has revolutionized computer vision and natural language processing, but its application to robotic navigation remains underexplored due to the difficulty of defining effective self-supervision signals. Motivated by these observations, in this work, we propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP). Instead of detecting salient objects that are beneficial for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Robotics and Automated Systems
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings · Dropout · Softmax · Residual Connection · Dense Connections
