LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding
Fusang Wang, Nathan Piasco, Moussab Bennehar, Luis Rold\~ao, Dzmitry Tsishkou, Fabien Moutarde

TL;DR
This paper introduces LESV, a novel 3D scene understanding framework that uses structured sparse voxel rasterization and foundation model alignment to improve open-vocabulary recognition accuracy.
Contribution
LESV leverages structured sparse voxel rasterization and foundation model alignment to address spatial and semantic ambiguities in open-vocabulary 3D scene understanding.
Findings
Achieves state-of-the-art results on open vocabulary 3D object retrieval.
Excels in fine-grained query scenarios where previous registration methods struggle.
Provides a stable geometric foundation with monocular priors for deterministic feature registration.
Abstract
Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
