TL;DR
AnyImageNav introduces a geometry-based, training-free approach for precise image-goal navigation, significantly improving success rates and pose accuracy over prior methods.
Contribution
It presents a novel semantic-to-geometric cascade that enables exact 6-DoF pose recovery without training, advancing image-goal navigation precision.
Findings
Achieves 93.1% success on Gibson and 82.6% on HM3D datasets.
Provides pose recovery with 0.27m position error and 3.41° heading error on Gibson.
Outperforms prior methods with 5-10x better pose accuracy.
Abstract
Image Goal Navigation (ImageNav) is evaluated by a coarse success criterion, the agent must stop within 1m of the target, which is sufficient for finding objects but falls short for downstream tasks such as grasping that require precise positioning. We introduce AnyImageNav, a training-free system that pushes ImageNav toward this more demanding setting. Our key insight is that the goal image can be treated as a geometric query: any photo of an object, a hallway, or a room corner can be registered to the agent's observations via dense pixel-level correspondences, enabling recovery of the exact 6-DoF camera pose. Our method realizes this through a semantic-to-geometric cascade: a semantic relevance signal guides exploration and acts as a proximity gate, invoking a 3D multi-view foundation model only when the current view is highly relevant to the goal image; the model then self-certifies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
