LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding

Fusang Wang; Nathan Piasco; Moussab Bennehar; Luis Rold\~ao; Dzmitry Tsishkou; Fabien Moutarde

arXiv:2604.01388·cs.CV·April 3, 2026

LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding

Fusang Wang, Nathan Piasco, Moussab Bennehar, Luis Rold\~ao, Dzmitry Tsishkou, Fabien Moutarde

PDF

TL;DR

This paper introduces LESV, a novel 3D scene understanding framework that uses structured sparse voxel rasterization and foundation model alignment to improve open-vocabulary recognition accuracy.

Contribution

LESV leverages structured sparse voxel rasterization and foundation model alignment to address spatial and semantic ambiguities in open-vocabulary 3D scene understanding.

Findings

01

Achieves state-of-the-art results on open vocabulary 3D object retrieval.

02

Excels in fine-grained query scenarios where previous registration methods struggle.

03

Provides a stable geometric foundation with monocular priors for deterministic feature registration.

Abstract

Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.