End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon
Guillaume Bono, Leonid Antsfeld, Boris Chidlovskii, Philippe, Weinzaepfel, Christian Wolf

TL;DR
This paper introduces a novel approach for image goal navigation that leverages pretext tasks and a dual encoder model to improve visual correspondence understanding, achieving state-of-the-art results in complex, unseen environments.
Contribution
The authors propose a dual encoder with a large-capacity binocular ViT and a two-step pretext training process to enhance visual correspondence and navigation performance.
Findings
Significant improvements on ImageNav benchmarks.
State-of-the-art performance on Instance-ImageNav with varying camera parameters.
Emergence of correspondence solutions from training signals.
Abstract
Most recent work in goal oriented visual navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact representations generalizable to unseen environments and in learning high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is not given as a category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception module needs to learn a comparison strategy requiring to solve an underlying visual correspondence problem. This has been shown to be difficult from reward alone or with standard auxiliary tasks. We address this problem through a sequence of two pretext tasks, which serve as a prior for what we argue is one of the main bottleneck in perception, extremely wide-baseline relative pose estimation and visibility prediction in complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques
