Hierarchical Scoring with 3D Gaussian Splatting for Instance Image-Goal Navigation
Yijie Deng, Shuaihang Yuan, Geeta Chandra Raju Bethala, Anthony Tzes, Yu-Shen Liu, Yi Fang

TL;DR
This paper introduces a hierarchical scoring framework for instance image-goal navigation that efficiently identifies optimal viewpoints using semantic and geometric cues, improving performance and reducing redundancy in view sampling.
Contribution
The paper presents a novel hierarchical scoring method combining semantic and geometric cues for better viewpoint selection in IIN, leveraging 3D Gaussian splatting and CLIP.
Findings
Achieves state-of-the-art results on simulated benchmarks
Demonstrates effective real-world applicability
Reduces redundancy in view sampling
Abstract
Instance Image-Goal Navigation (IIN) requires autonomous agents to identify and navigate to a target object or location depicted in a reference image captured from any viewpoint. While recent methods leverage powerful novel view synthesis (NVS) techniques, such as three-dimensional Gaussian splatting (3DGS), they typically rely on randomly sampling multiple viewpoints or trajectories to ensure comprehensive coverage of discriminative visual cues. This approach, however, creates significant redundancy through overlapping image samples and lacks principled view selection, substantially increasing both rendering and comparison overhead. In this paper, we introduce a novel IIN framework with a hierarchical scoring paradigm that estimates optimal viewpoints for target matching. Our approach integrates cross-level semantic scoring, utilizing CLIP-derived relevancy fields to identify regions…
Peer Reviews
Decision·Submitted to ICLR 2026
* S1: The limitations of previous work are clearly presented and the proposed contributions are thus properly motivated. * S2: The proposed method is clearly explained, and leads to high performance and more efficient runtime than other methods. * S3: Real-world experiments are conducted (in appendix), which is appreciated.
- W1: [Major] The proposed method requires a first exploratory rollout to build the scene representation, which is a quite important limitation. However, authors conduct some experiments where they evaluate the performance of their approach from partial scene representations, simulating an unfinished exploration of the scene. This is emulated by randomly pruning gaussians. Unfortunately, this does not exactly simulate incomplete scene exploration as whole parts of the scene would be unknown. - W
1) The research topic on the IIN task is valuable, the authors aim to improve the existing 3DGS-based IIN method by introducing a hierarchical scoring approach, which is reasonable. 2) In the hierarchical scoring approach, both high-level semantic alignment and fine-grained geometric matching are utilized to recognize the target area, which obviates the need for exhaustive or random sampling through the environment in existing methods. 3) The experiments on both simulation and real-world benchma
1) The memory cost of saving the CLIP feature Gaussian field for a large-scale IIN task may be too large. From table 3, it seems that only the memory cost of 3DGS is compared, so is the CLIP feature field included? 2) About the time efficiency, does the hierarchical scoring cost more time than existing methods during inference? Since there is no comparison of inference time and memory cost. 3) In sec. 4.5, I am curious about why deleting so many Gaussians can still localize the target object? Do
Reframes IIN over 3DGS as a view selection problem with semantic→geometric two-level scoring rather than brute-force rendering. Reports SOTA on simulated IIN (HM3D/Habitat) and demonstrates real-world deployment (humanoid platform).
1. Section 3.4.1 Local scoring for region selection — clarity & notation a. Motivation/examples for ray selection. Could you add a short motivation and one concrete example of what kinds of rays are expected to receive high scores vs. low scores? b. what is “ground-truth rays” and “ground-truth scores" c. Please clarify which tensor is the query and which are key/value in the cross-attention (Eq.5) d. Case and notation consistency. In 3.4.1, there are mixed upper/lower-case usages f
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis
