Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation

Won Shik Jang; Ue-Hwan Kim

arXiv:2603.09506·cs.CV·March 19, 2026

Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation

Won Shik Jang, Ue-Hwan Kim

PDF

Open Access

TL;DR

Context-Nav introduces a geometry-grounded, viewpoint-aware 3D spatial reasoning approach for text-goal instance navigation, improving exploration and disambiguation without task-specific training.

Contribution

It presents a novel framework that combines global exploration guided by dense text-image alignments with 3D spatial verification, achieving state-of-the-art results without training.

Findings

01

State-of-the-art performance on InstanceNav and CoIN-Bench datasets.

02

Encoding full captions into the value map improves exploration efficiency.

03

Viewpoint-aware 3D verification reduces incorrect stopping in navigation.

Abstract

Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present \textit{Context-Nav}, which elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers -- guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or fine-tuning; we attain state-of-the-art performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Robot Manipulation and Learning