TL;DR
Ilov3Splat is a new framework for open-vocabulary 3D scene understanding that combines Gaussian splatting with language-aligned features for instance-level recognition without manual annotations.
Contribution
It introduces a method that jointly optimizes scene geometry and semantic features using CLIP and SAM, enabling accurate language-driven 3D object detection and segmentation.
Findings
Outperforms prior methods in object selection and instance segmentation
Supports arbitrary object recognition based on natural language
Operates without category supervision or manual annotations
Abstract
We introduce Ilov3Splat, a novel framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D-GS). Most prior work depends on 2D rendering-based matching or point-level semantic association, which undermines cross-view consistency, lacks coherent instance-level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view-consistent feature fields. Specifically, we leverage multi-resolution hash embedding to efficiently encode language-aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine-grained object distinction across views. At inference time, CLIP-encoded queries are matched…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
