What does really matter in image goal navigation?
Gianluca Monaci, Philippe Weinzaepfel, Christian Wolf

TL;DR
This paper investigates whether end-to-end reinforcement learning can effectively solve image goal navigation by analyzing architectural choices and transferability to realistic settings, highlighting the emergence of relative pose estimation.
Contribution
The study provides a comprehensive analysis of architectural impacts and transferability in image goal navigation, challenging assumptions about the necessity of dedicated modules.
Findings
Success of recent methods is partly due to simulator shortcuts.
Capabilities can transfer to more realistic settings to some extent.
Navigation performance correlates with emergent relative pose estimation.
Abstract
Image goal navigation requires two different skills: firstly, core navigation skills, including the detection of free space and obstacles, and taking decisions based on an internal representation; and secondly, computing directional information by comparing visual observations to the goal image. Current state-of-the-art methods either rely on dedicated image-matching, or pre-training of computer vision modules on relative pose estimation. In this paper, we study whether this task can be efficiently solved with end-to-end training of full agents with RL, as has been claimed by recent work. A positive answer would have impact beyond Embodied AI and allow training of relative pose estimation from reward for navigation alone. In this large experimental study we investigate the effect of architectural choices like late fusion, channel stacking, space-to-depth projections and cross-attention,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques
