RANGER: A Monocular Zero-Shot Semantic Navigation Framework through Visual Contextual Adaptation
Ming-Ming Yu, Yi Chen, B\"orje F. Karlsson, Wenjun Wu

TL;DR
RANGER is a monocular, zero-shot semantic navigation framework that uses visual context to improve target localization without relying on depth or pose data.
Contribution
It introduces a novel approach combining 3D reconstruction, vision-language models, and visual in-context learning to enhance navigation in complex environments.
Findings
Achieves competitive success rates on HM3D benchmark.
Demonstrates superior visual in-context learning adaptability.
Operates effectively without prior 3D environment mapping.
Abstract
Efficient target localization and autonomous navigation in complex environments are fundamental to real-world embodied applications. While recent advances in multimodal foundation models have enabled zero-shot object goal navigation, allowing robots to search for arbitrary objects without fine-tuning, existing methods face two key limitations: (1) heavy reliance on ground-truth depth and pose information, which restricts applicability in real-world scenarios; and (2) lack of visual in-context learning (VICL) capability to extract geometric and semantic priors from environmental context, as in a short traversal video. To address these challenges, we propose RANGER, a novel zero-shot, open-vocabulary semantic navigation framework that operates using only a monocular camera. Leveraging powerful 3D foundation models, RANGER eliminates the dependency on depth and pose while exhibiting strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
