Hierarchical Scoring with 3D Gaussian Splatting for Instance Image-Goal Navigation

Yijie Deng; Shuaihang Yuan; Geeta Chandra Raju Bethala; Anthony Tzes; Yu-Shen Liu; Yi Fang

arXiv:2506.07338·cs.CV·June 10, 2025

Hierarchical Scoring with 3D Gaussian Splatting for Instance Image-Goal Navigation

Yijie Deng, Shuaihang Yuan, Geeta Chandra Raju Bethala, Anthony Tzes, Yu-Shen Liu, Yi Fang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a hierarchical scoring framework for instance image-goal navigation that efficiently identifies optimal viewpoints using semantic and geometric cues, improving performance and reducing redundancy in view sampling.

Contribution

The paper presents a novel hierarchical scoring method combining semantic and geometric cues for better viewpoint selection in IIN, leveraging 3D Gaussian splatting and CLIP.

Findings

01

Achieves state-of-the-art results on simulated benchmarks

02

Demonstrates effective real-world applicability

03

Reduces redundancy in view sampling

Abstract

Instance Image-Goal Navigation (IIN) requires autonomous agents to identify and navigate to a target object or location depicted in a reference image captured from any viewpoint. While recent methods leverage powerful novel view synthesis (NVS) techniques, such as three-dimensional Gaussian splatting (3DGS), they typically rely on randomly sampling multiple viewpoints or trajectories to ensure comprehensive coverage of discriminative visual cues. This approach, however, creates significant redundancy through overlapping image samples and lacks principled view selection, substantially increasing both rendering and comparison overhead. In this paper, we introduce a novel IIN framework with a hierarchical scoring paradigm that estimates optimal viewpoints for target matching. Our approach integrates cross-level semantic scoring, utilizing CLIP-derived relevancy fields to identify regions…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

* S1: The limitations of previous work are clearly presented and the proposed contributions are thus properly motivated. * S2: The proposed method is clearly explained, and leads to high performance and more efficient runtime than other methods. * S3: Real-world experiments are conducted (in appendix), which is appreciated.

Weaknesses

- W1: [Major] The proposed method requires a first exploratory rollout to build the scene representation, which is a quite important limitation. However, authors conduct some experiments where they evaluate the performance of their approach from partial scene representations, simulating an unfinished exploration of the scene. This is emulated by randomly pruning gaussians. Unfortunately, this does not exactly simulate incomplete scene exploration as whole parts of the scene would be unknown. - W

Reviewer 02Rating 6Confidence 4

Strengths

1) The research topic on the IIN task is valuable, the authors aim to improve the existing 3DGS-based IIN method by introducing a hierarchical scoring approach, which is reasonable. 2) In the hierarchical scoring approach, both high-level semantic alignment and fine-grained geometric matching are utilized to recognize the target area, which obviates the need for exhaustive or random sampling through the environment in existing methods. 3) The experiments on both simulation and real-world benchma

Weaknesses

1) The memory cost of saving the CLIP feature Gaussian field for a large-scale IIN task may be too large. From table 3, it seems that only the memory cost of 3DGS is compared, so is the CLIP feature field included? 2) About the time efficiency, does the hierarchical scoring cost more time than existing methods during inference? Since there is no comparison of inference time and memory cost. 3) In sec. 4.5, I am curious about why deleting so many Gaussians can still localize the target object? Do

Reviewer 03Rating 4Confidence 4

Strengths

Reframes IIN over 3DGS as a view selection problem with semantic→geometric two-level scoring rather than brute-force rendering. Reports SOTA on simulated IIN (HM3D/Habitat) and demonstrates real-world deployment (humanoid platform).

Weaknesses

1. Section 3.4.1 Local scoring for region selection — clarity & notation a. Motivation/examples for ray selection. Could you add a short motivation and one concrete example of what kinds of rays are expected to receive high scores vs. low scores? b. what is “ground-truth rays” and “ground-truth scores" c. Please clarify which tensor is the query and which are key/value in the cross-attention (Eq.5) d. Case and notation consistency. In 3.4.1, there are mixed upper/lower-case usages f

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis